Recent Work Review

Slide Note

StackOverflow and Github are crucial platforms in the IT field, with challenges in repository search and user understanding. To address these issues, this work presents a graph-based tag assignment method for linking Github repositories with relevant StackOverflow tags. The approach involves graph representation, similarity-based relations, and transition graph building to infer tag assignments for repositories. By leveraging data from StackOverflow and Github, the study aims to enhance repository categorization and user collaboration within the developer community.

khls330 Follow

Uploaded on Feb 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Recent Work Review A Graph-Based Tag Assignment Approach for Github Repository Cai Xuyang 2015.10.08

Outline Introduction Approach Experiment Future work

Introduction StackOverflow has becoming one of the most popular Q/A websites in IT fields in the world. Github is also the most favorite open source project community for developers.

Introduction Question text Annotated tags

Introduction Project Description Readme Text

Introduction What is lacked in Github? 1. Hard to search repositories in same categories. 2. Not easy to understand what repository does quickly. 3. Hard to find users who have the same skills or interests. Tagging Why we use StackOverflow? 1. Full of manually labeled data 2. Rich of domain- specific text information 3. Some User collaboration activities like Github

Data Selection & Preprocess Identify more than 35,000 linked-user pairs Collect all the useful questions in StackOverflow and repositories in Github Selecting representational data in both communities Code, Stop words and symbols removal

Problem Definition Given an unlabeled data set in Github, the task of tag assignment is to discover a list of tags in StackOverflow for each repository in Github.

Approach Graph Representation Dependent Relation Similarity-Based Relation

Approach Transition Graph Building The transition graph explains the propagation relationship between entities.

Approach Transition Graph Building We consider two factors to compute wij based on the similarities or relatedness. The details of these factors are as follows: I. Lexical Factor: This factor is the measurement of the similarity between two vertex in terms of their text information. 1) Using TF-IDF to compute word weight 2) Constructing VSM for each vertex s document

Approach Transferring Graph Building We consider two factors to compute wij based on the similarities or relatedness. The details of these factors are as follows: II. User Collaboration Factor: People who use Github to develop in specific technical fields would like to search, ask or answer the corresponding content in StackOverflow, and vice versa. We then use Jaccard similarity to compute the user collab- oration factor.

Approach Transferring Graph Building We consider two factors to compute wij based on the similarities or relatedness. The details of these factors are as follows: Compute wij with the combination of two factors:

Approach Transferring Graph Building Make MT more sparse Selecting the two types of vertex s top large value in each row. Make MT normalized

Approach Target Graph Building The target graph explains the relationship between tags and entities.

Approach Target Graph Building In the target graph GO, we use two methods to compute dependent relationship dij in face of different entities. Question Initialization Repository Initialization

Approach Question Initialization Considering that questions in StackOverflow has already tagged by users, we use user-labeled tags to initialize the lower-semi matrix MO from the |R + 1|th to |VT|th row. Repository Initialization Key words Identification & Extraction. Setting initial weight for key word contained in T.

Approach Subgraph Extraction 1) Filter the question set & user set Where: User size in question > Su Document length > L Result: Q -> Q , U -> U 2) Filter the repository set Where: repository contains user in U Result: R -> R Repeat until: repository set size > Sr

Approach Step1: User size in question > 2 Step2: Repeat until repository set > 2 Step3: Repeat until repository set > 2

Iterative Algorithm We realize our algorithm with A random walk with restart. We follow the standard convention and set = 0.85. The algorithm is convergence when :

Experiment Our approach: Graph-based Tagging Approach(GTA) V.S Supervised Methods: 1. K-Nearest Neighbours (KNN) 2. Labeled Latent Dirichlet Allocation(LLDA) Unsupervised Methods: 1. Term Frequency Inverse Document Frequency(TF-IDF) 2. Latent Dirichlet Allocation(LDA)

Experiment Evaluation Steps Randomly selecting 500 repositories in our result. Annotating 1~5 tags for each repository manually. Predicting 5 tags for repositories by each method. Calculating precision, recall and F1-score.

Experiment Performance with different value 40.00% 33.41% 35.00% 31.53% 29.83% 30.00% 27.52% 23.68% 25.00% 20.00% 13.68% 15.00% 9.56% 8.74% 10.00% 6.59% 5.00% 2.41% 0.00% 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Experiment Precision 60.00% 50.00% 48.89% 40.00% 35.78% 35.11% 35.56% 28.89% 28.52% 27.78% 30.00% 26.11% 29.63% 28.33% 20.00% 18.89% 17.33% 24.00% 20.00% 17.04% 17.78% 15.56% 14.81% 14.22% 5.78% 13.33% 10.00% 6.11% 4.44% 2.22% 4.44% 0.00% 1 2 3 4 5 TF-IDF KNN LDA LLDA GTA

Experiment Recall 50.00% 46.37% 43.88% 45.00% 38.66% 40.00% 40.30% 38.06% 35.00% 31.94% 29.10% 30.00% 26.87% 29.85% 25.00% 19.73% 23.88% 23.88% 20.00% 17.16% 20.90% 12.69% 15.00% 16.42% 9.70% 9.70% 14.93% 8.21% 10.00% 4.48% 8.96% 5.00% 5.97% 0.75% 2.99% 2 0.00% 1 3 4 5 TF-IDF KNN LDA LLDA GTA

Experiment F1-Score Measure 40.00% 34.02% 33.75% 33.41% 32.82% 35.00% 32.48% 30.00% 30.08% 25.26% 29.74% 28.57% 22.93% 25.00% 21.73% 24.58% 20.00% 17.10% 15.18% 14.53% 17.83% 17.83% 15.00% 14.87% 10.00% 7.24% 7.01% 10.71% 4.46% 8.94% 5.00% 3.57% 1.12% 0.00% 1 2 3 4 5 TF-IDF KNN LDA LLDA GTA