Analysis and Visualization of McGill Theses Data Using Word2Vec by Rajat Bhateja

semantic context analysis and visualization n.w

1 / 20

Embed Share

Explore the semantic context analysis and visualization of McGill theses data utilizing Word2Vec, as presented by Rajat Bhateja. The process involves parsing and cleaning corpora, handling XML and HTML sources, solving parsing issues, Word2Vec application, and utilizing tools like Beautiful Soup, PostgreSQL, and Gensim in Python. The journey also covers combining metadata and raw theses text, creating W2V models, and implementing frontend technologies like Express.js and D3 for web development and visualization.

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Semantic, Context Analysis and Visualization of McGill theses data using word2vec By Rajat Bhateja

1 Corpora Parsing and Cleaning

Xml and HTML sources Xml Files ~49,000 xml files which contained various metadata information like year of publication, discipline, author, title of publication HTML Files Mostly scanned and then OCRed ~39,000 files. Contained the raw text theses data. 3

2 A Problem in Parsing and how a raw mapping helped solve it

Xml within an Xml 5

HTML parsing and cleaning Parsing Used some basic functionality of the Beautiful Soup library to parse the content in each file Cleaning Removed all punctuations, digits, http links; stripped spaces and converted text to lowercase. Saving Saved the resulting text from each file alongside their corresponding pids in a Postgres database using psycopg2. 6

3 Context and Semantics Using word2vec

Word2vec Uses words as vector representation called word embeddings . A Vector space model represents them in a continuous vector space. We use the Skip gram version of Gensim implementation in Python. 8

Until now HTML W2V Model Final Corpa PID XML 1 1. Metadata and Raw Theses Text 2 2. Combined by raw mapping 3 3. To get the Final Cleaned Corpora 4 4. Used as input to make W2V Models 9

4 Time to conFRONT END Using Express and D3

Express JS NodeJS based web framework Asynchronous Follows REST API Handles routing easily Adapts to middleware 11

D3 For visualization 12

D3 (Data Driven Documents) General purpose visualization library. Data can be transformed to information, elements and documents. Primarily uses json and csv data. Can work with other files types as well. Implemented bubble chart using D3 Force Layout 13

D3 Force Layout Strategy for displaying data elements, visually, that position linked nodes using physical simulation. Way to draw a collection of data elements (nodes) and how they relate to each other (links) and use some algorithms that represent how things may relate in the physical world. 14

Further overview HTML CSS W2V Model D3 Express 2 2. A Python script fetches the most similar words from the W2V model 4 4. D3 visualizes the returned data for the user 3 3. Express handles the requests following REST fundamentals 1 1. User enters the word to find in the corpora 15

Corpora split Could be split by Year and Degree discipline. Tried creating w2v model for every year. Decided to split by Y2K for the first implementation. 16

Live Demo* *if everything works fine 17

5 Future Implementation More corpora and more time split

Future Implementation Bubbles to split for every decade. Add a new University corpora for more interesting comparisons. Any other suggestions welcome. ETA end of the year. 19

Thanks! Any questions? 20

Analysis and Visualization of McGill Theses Data Using Word2Vec by Rajat Bhateja

Download Presentation

Presentation Transcript

Related

More Related Content