Extracting Domain Entities from Scientific Papers Leveraging Author Keywords

1 / 19

Embed Share

Discover how to optimize named entity recognition in specialized domains like artificial intelligence by leveraging existing resources. Explore methodologies, experiments, and results in extracting domain entities from scientific papers using author keywords.

aza_c Follow

Uploaded on Mar 21, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

EEKE 2021 Extracting Domain Entities from Scientific Papers Leveraging Author Keywords Jiabin Peng, Jing Chen and Guo Chen

1 Background & Related Studies CONTENTS CONTENTS 2 3 4 Methodology Experiment & Results Conclusion

Background The rise of data-hungry deep-learning systems increased the performance of named entity recognition (NER). In many subdivided domains, annotated corpus is very scarce and expensive. A large number of knowledge resources have been accumulated in many domains, such as knowledge bases, gazetteers, glossaries, dictionaries Key task: How to make full use of existing resources in domain NER methods to minimize the annotation cost Taking the domain of artificial intelligence (AI) as an example we recognized problem and solution entities. problem : research objectives, domains, applications, tasks solution : schemes, models, technologies, tools, software, algorithms, theories

Related Studies Transfer Learning Distant Supervision, Semi-Supervision, Weak Supervision, Shortcoming: cannot avoid manual participation in the construction of datasets New ideas: Zero-Shot Learning, Learning with Noisy Labels,

Methodology Framework Domain NER was divided into two subtasks: entity boundary recognition & entity classification

Methodology Entity Boundary Recognition Domain keyword set could cover the labeled entities to a great extent (about 90%) Entity Boundary Recognition English Word Segmentation Forward Maximum String Matching Lexicon = {convolutional neural network, image recognition task, image recognition, } Sentence = We proposed an improved algorithm which made convolutional neural network more robust in general image recognition task because the algorithm could dynamically model the task. Result = We proposed an improved algorithm which made convolutional neural network more robust in general image recognition task because the algorithm could dynamically model the task .

Methodology Entity Boundary Recognition There would be some errors Word Stemming Error Matching Lexicon = {convolutional neural network, image recognition task, image recognition, dynamical model, } Result = We proposed an improved algorithm which made convolutional neural network more robust in general image recognition task because the algorithm could dynamically model the task . Classification Model Noise Part of Speech

Methodology Entity Boundary Recognition There would be some errors Word Stemming Error Matching Lexicon = {convolutional neural network, image recognition task, image recognition, general image, } Result = We proposed an improved algorithm which made convolutional neural network more robust in general image recognition task because the algorithm could dynamically model the task . Better Algorithm Error

Methodology Entity Classification Multi-class classification task Construct Training Data Entities Positive Samples Labels Training Data Word-level non-entities Negative Samples Phrase-level non-entities Construct Text Features Word Vector basic input , Part of Speech, Word Case Models Random Forest, K-Nearest Neighbor, Support Vector Machine, Multilayer Perceptron TextCNN

Methodology Feature Processing Word Vector Model: Word2Vec Corpus Processing Word stemming Concatenating the words in a phrase by _ Original sentence : Support vector machine and random forest were used in this paper. Processed sentences: support_vector_machin and random_forest were use in thi paper . Model Parameter Selection Algorithm: Skip-gram or CBOW Window size: 5 or 10

Methodology Feature Processing Part of Speech (POS) Toolkit: nltk (36 POS in total) VB VBG NN NNS the other 32 POS support 100 80 90 30 0, 0, , 0 vector 0 0 140 60 0, 0, , 0 machin 30 10 100 60 0, 0, , 0 VB VBG NN NNS the other 32 POS Normalization support 0.333 0.267 0.3 0.1 0, 0, , 0 vector 0 0 0.7 0.3 0, 0, , 0 machin 0.15 0.05 0.5 0.3 0, 0, , 0

Methodology Feature Processing Part of Speech (POS) Vec [support] = [0.333, 0.267, 0.3, 0.1, 0, 0, , 0] Vec [vector] = [0, 0, 0.7, 0.3, 0, 0, , 0] Vec [machin] = [0.15, 0.05, 0.5, 0.3, 0, 0, , 0] len = 36 * Max(len(phrase{lexicon})) Vec [support vector machin] = [0.333, 0.267, 0.3, 0.1, 0, 0, , 0, 0, 0, 0.7, 0.3, 0, 0, , 0, 0.15, 0.05, 0.5, 0.3, 0, 0, , 0, 0, 0, , 0] len = 36 len = 36 len = 36

Methodology Feature Processing Case Three types of phrase cases were defined Initial uppercase All uppercase All lowercase initial uppercase all uppercase all lowercase support_vector_machin 40 0 60 svm 10 80 10 Normalization initial uppercase all uppercase all lowercase support_vector_machin 0.4 0 0.6 svm 0.1 0.8 0.1 Vec [support_vector_machin] = [0.4, 0, 0.6] Vec [svm] = [0.1, 0.8, 0.1]

Experiment & Result Experimental Data Abstract Bibliography Data Keywords word segmentation lexicon Problem entities Glossary Solution entities

Experiment & Result Experimental Data 360 problem entities Entities 360 solution entities 1080 pieces of data Training Data about 240 phrase-level non-entities Non-entities about 120 word-level non-entities 2000 pieces BERT-BiLSTM-CRF Test Data (3000 pieces) 1000 pieces Common Test Set

Experiment & Result Result Analysis sg=1 sg=0 w=5 0.672 0.55 0.736 0.672 0.695 w=10 0.681 0.672 0.701 0.685 0.685 w=5 0.666 0.151 0.709 0.428 0.67 w=10 0.655 0.146 0.679 0.478 0.67 RF KNN SVM MLP TextCNN Tabel1 Macro F1-measure of models using different word vectors on the test set sg=1 meant algorithm was Skip-gram, sg=0 was CBOW. w was used to represent the window size

Experiment & Result Result Analysis f1 f1+f2 f1+f3 f1+f2+f3 P R F1 P R F1 P R F1 P R F1 0.736 RF 0.588 0.812 0.681 0.629 0.824 0.713 0.618 0.812 0.702 0.661 0.833 0.687 KNN 0.593 0.78 0.672 0.603 0.773 0.676 0.604 0.784 0.681 0.614 0.783 0.753 SVM 0.677 0.81 0.736 0.701 0.809 0.749 0.69 0.812 0.744 0.706 0.812 0.709 MLP 0.593 0.813 0.685 0.604 0.815 0.694 0.605 0.814 0.694 0.621 0.826 0.715 TextCNN 0.63 0.785 0.695 0.631 0.815 0.701 0.64 0.765 0.697 0.65 0.81 Voting - - - - - - - - - 0.689 0.831 0.752 BERT-BiLSTM-CRF(baseline) P: 0.756 R: 0.789 F1:0.772 Tabel2 Macro P, R, F1-measure of models using different features on the test set. f1 was word vector, f2 was POS feature, f3 was case feature

Conclusion A two-stage knowledge entity extraction methodology was proposed, which can get rid of the dependence on manually annotation data. Our methodology has good domain generalization because it does not need manual annotation. Future Work Better algorithm for English word segmentation Try to use dynamic word embeddings Try to add more features

EEKE 2021 Thanks for your listening! If you have any questions, please contact us Guo Chen : delphi1987@qq.com Jiabin Peng : 2542505085@qq.com 2021.9.30

Extracting Domain Entities from Scientific Papers Leveraging Author Keywords

Download Presentation

Presentation Transcript

Related

More Related Content