
Concept-Based Keyword Extraction for Arabic Documents
Enhance keyword extraction from Arabic texts with a concept-based approach. Explore the importance of automated keyword extraction for various applications like NLP and information retrieval. Review related works and the proposed system using Bag-of-Words (BoW) methodology.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
BAG-OF-CONCEPT BASED KEYWORD EXTRACTION FROM ARABIC DOCUMENTS Dima Suleiman Dr. Arafat Awajan 1
Agenda Introduction Related Works Proposed System Performance Evaluation Conclusion 2
Introduction Nowadays there are a large number of documents available in electronic form. Many available online documents don t have keywords which limit the access to their contents. The rapid growth in the Internet and its unstructured content increase the importance of keyword extraction. 3
Introduction Keyword extraction are very important for many applications: Natural Language Processing Information retrieval tasks Summarization and document clustering Manual extraction of keywords is a time consuming process. Automatic keyword extraction is one of the most important tasks in text mining that replace the manual extraction of keywords 4
Related Works Keyword extraction from Arabic document is a new area and has few researches. (El-Beltagy et al. 2008) presented approach called KP-Miner for extracting keywords from both Arabic and English documents. KP-Miner used TFxIDF weights. KP-Miner also used two conditions The first one is that the phrase from which the keyword will be extracted must occur in the text at least n times. 5 The second one is the position of the keyword in the document
Related Works (El-Shishtawy et al. 2015) used machine learning and linguistic methods to extract keywords. They used linguistic information in three phases: Tokenization Word abstract form Part of speech tagging False results may occur since the system deals with all types of the word such as verb, names, functions and adjectives equally. 6
Related Works (AWAJAN 2015) Combined statistical analysis of the text with linguistics. Preprocessing phase is very important such as extracting the roots of derivative words. The next phase is to create classes of clustered words that contain most frequent words Experiments showed that the average precision was 31% and average recall was 53% which considered good. 7
Proposed System In the Bag-of-Words (BoW) a vector is used to represents the text. The elements of the vector are the weight of frequency-based of the text words. 8
Proposed System Problems: Ignores the meaning of words The dimension of (BoW) in general huge Solution using Bag-of-Concept (BoC) Bag-of-Concepts (BoC): concerned with grouping words with similar meaning together 9
Proposed System The new algorithm will extract keywords from individual documents without any pre knowledge about its domain Start Convert text to words(extracting the root/stem) Part of Speech (POS) Tokenization Calculate the weight of each stem Filtering or cleaning Bag-of-Concept (BoC) Create classes for synonym words Clustering N- grams frequency Unigram, Bigram, Trigrams 10 Extraction of keywords
Proposed System The main idea of keyword extraction is to use the frequency of the word to extract keywords, The words that occur frequently are most often represent the meaning of the text. However words such as stop words and punctuation must be removed. 11
Proposed System . . . . . 12 .
Proposed System Table I. Classes and their counts Class Label Count Symbol A B C D E F Root/Stem 29 24 22 21 19 14 G 13 H 12 I 11 13
Proposed System Table I. Classes and their counts Class Label Count Symbol A B C D E F Root/Stem 29 24 22 21 19 14 G 13 H 12 I 11 14
Proposed System Table II. Classes of synonym words and their counts Class Label Count Symbol Root/Stem A 29 B 24 C 22 D 40 E 14 F 13 G 12 H 11 I 10 15
Proposed System Table II. Classes of synonym words and their counts Class Label Count Symbol Root/Stem A 29 B 24 C 22 D 40 E 14 F 13 G 12 H 11 I 10 16
Proposed System Table III. Bigrams and their scores Weight of the first term 40 40 40 40 40 40 29 40 29 40 24 40 Weight of the second term 29 22 24 14 13 11 22 10 24 9 22 8 Weight of bigram/count Score 21 12 8 6 6 7 7 7 3 6 8 5 Bigram 90 74 72 60 59 58 58 57 56 55 54 53 18
Proposed System Table IV. Trigrams and their scores Weight of the third term 22 24 22 13 11 14 10 8 24 9 24 22 Weight of the second term 29 29 24 29 29 29 29 29 14 29 13 13 Weight of the first term 40 40 40 40 40 40 40 40 40 40 40 40 Trigram Weight of trigram/count Score 6 3 6 4 5 2 4 5 3 2 2 4 97 96 92 86 85 85 83 82 81 80 79 79 19
Proposed System Table V. Candidate keywords and their score Keyword Score 97 96 92 90 86 85 85 20
Performance Evaluation The experiments divided into three groups; in each group we considered different number of keywords. The number of keywords in the first group will be 5, in the second group 10 and finally 15 keywords in the last group The results of applying the new approach on three documents will be displayed in Table VIII display the results. 21
Performance Evaluation The title of the three documents: Table VIII display the results.. Document number Number of keywords per document Number of extracted keywords 10 F P R 5 15 R P R F P F 15 9 0.48 0.53 0.9 0.53 - 1 2 0.8 0.27 1 0.4 0.67 0.9 0.6 0.4 0.9 0.5 - 0.5 - 22 0.4 15 0.5 0.33 0.4 0.4 0.4 3 0.8 0.27 0.4 5
Conclusion This research uses linguistics and statistical analysis of data to extract keywords automatically. The new approach group words that have the same stem or root into the same class in addition to grouping the words that have the same synonyms together into same class, this process will enforce the score of candidate keyword and make the extraction process more accurate. Experiments showed that the extraction process become more accurate due to the recall, precision and F-measure values. 23
Thank you 24