Automatically Constructing a Tagging System Overview" (Character count: 47)
This review delves into the process of automatically constructing a tagging system by examining tag extraction, assignment, and recommendation in software engineering. It discusses the advantages of tagging, current website tagging scenarios, and the techniques used for tag extraction, with a focus on supervised and unsupervised methods. The challenges and improvements in supervised tag extraction are also explored, providing insights into enhancing the efficiency and accuracy of this system. Ultimately, the aim is to enhance the organization and understanding of content through an automated tagging approach.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Review of Automatically Constructing A Tagging System Cai Xuyang 2/16/2025
Outline Overview Tag Extraction Tag Assignment Tag Recommendation in Software Engineering
Overview What is tag? Advantages 1. Easier to classification by admin 2. Abstract representation / Easier to understand by user 3. Deep relations finding(Ontology / Taxonomy) 4. Tracing objects evolving histories
Overview Current Situation in websites: 1. Manual tag system (Weibo / Twitter / StackOverflow) 2. No tag system (GitHub) Important! Automatically Constructing Tagging System
Tag Extraction Tag extraction mainly concerns how to select important and topical phrases from a textual description. Supervised 1. Binary Classification Unsupervised 1. Graph-based Approaches 2. Clustering-based Approaches
Tag Extraction Supervised Tag Extraction: A binary classification task Given a set of candidate phrases, whether these phrases are suitable for acting as tags. Naive Bayes Domain-specific keyphrase extraction , Kea: Practical automatic keyphrase extraction Decision Tree Learning to extract keyphrases from text Multi-layer Perceptron Finding advertising keywords on web pages , Re-examining automatic keyphrase extraction approaches in scientific articles
Tag Extraction Supervised Tag Extraction
Tag Extraction Supervised Tag Extraction : Kea Candidate phrases 1. Input cleaning 2. Phrase identification(heuristic rules) 3. Case-folding and stemming Feature calculation 1. TF-IDF 2. First occurrence Training Extraction of new keyphrases
Tag Extraction Drawbacks of supervised tag extraction: 1. Training data may be unbalanced 2. Difficult to distinguish representative tag from positive ex. 3. Need a training set with manual annotations Improvement : A pairwise ranking-based approach: ( A ranking approach to keyphrase extraction ) Learning to rank -> Ranking SVM Training Set : (non-)keyphrase -> (keyphrase non-keyphrase)
Tag Extraction Unsupervised Tag Extraction: Graph-based Main Concept: Build a graph from the input document and rank its nodes using a graph-based ranking method. TextRank( Textrank: Bringing order into texts ) It treats all the phrases as nodes and those phrases in a window with a fixed size are linked by unweighted edges.
Tag Extraction TextRank 1. It treats all the phrases as nodes and those phrases in a window with a fixed size are linked by unweighted edges. 2. A traditional PageRank algorithm is applied on this graph. 3. Phrases with high PageRank values are selected as tags.
Tag Extraction TextRank 1. It treats all the phrases as nodes and those phrases in a window with a fixed size are linked by unweighted edges. 2. A traditional PageRank algorithm is applied on this graph. 3. Phrases with high PageRank values are selected as tags. Improvement : Neighborhood knowledge sharing ( Single document keyphrase extraction using neighborhood knowledge ) It leverages a small number of nearest neighboring documents for tag extraction.
Tag Extraction Unsupervised Tag Extraction: Clustering-based Exemplar Terms( Clustering to find exemplar terms for keyphrase extraction ) Finding exemplar terms by leveraging clustering techniques, which guarantees the document to be semantically covered. Select candidate terms using some heuristic rules. Calculate term relatedness 1. Cooccurrence-based 2. Knowledge bases Clustering Algorithm applied 1. Hierarchical Clustering 2. Spectral Clustering 3. Affinity Propagation
Tag Extraction Drawback of Tag Extraction: 1. The description may be inadequate for tag extraction. 2. Only extracted from its textual description. 3. Not erratic
Tag Assignment Tag assignment aims to assign some tags to each documents, even though these tags may not occur in its descriptions. A predefined dictionary (hierarchical concept tree) Open Directory Project Wikipedia Tag Assignment Problem A multi-class multi-label Classification Problem
Tag Assignment Multi-class Multi-label Problems Solutions ( Multi-label classification: An overview ) Two main categories: Problem transformation methods Transform the multi-label classification problem either into one or more single-label classification or regression problems. Algorithm adaptation methods Extend specific learning algorithms in order to handle multi- label data directly.
Tag Assignment Problem transformation methods 1. selects one label 2. discards multi-label instance 4. Multiple binary classifiers 3. multi-label as a single label
Tag Assignment Algorithm adaptation methods: 1. C4.5 2. Adaboost.MH and Adaboost.MR 3. kNN lazy learning algorithm 4. SVM 5.
Tag Assignment Clare and King (2001) adapted the C4.5 algorithm for multi-label data. They modified the formula of entropy calculation as follows: where p(ci) = relative frequency of class ci and q(ci) = 1 p(ci). They also allowed multiple labels in the leaves of the tree.
Tag Assignment Adaboost.MH and Adaboost.MR are two extensions of AdaBoost (Freund & Schapire, 1997) for multi-label classification. The weak classifiers is positive labelled Adaboost.MH ranking each of the labels in L Adaboost.MR
Tag Assignment Several Optimizations Semi-Supervised Algorithm ( Mining multi-label data ) To solve when the size of the dictionary is large and training data for each tag is small. Deep Classification ( Deep classification in largescale text hierarchies ) Propose a search heuristic to reduce the number of tags by taking the hierarchical relations between them.
Tag Assignment Drawback of Tag Assignment: 1. Need a predefined dictionary 2. Omit the shared knowledge among repositories 3. New tags finding
Tag Recommendation in SE Tag recommendation in software engineering sites is in high demand. EnTagRec( Entagrec: an enhanced tag recommendation system for software information sites ) 1. Bayesian Inference 2. Frequentist Inference 3. Spreading activation technique Freecode & StackOverflow Tag recommendation( Tag recommendation in software information sites ) 1. Multi-label classification
Thanks Thanks Q&A Q&A