
Introduction to Text Analysis with Anaconda and Virtual Environments
Explore the world of text analysis using tools like Anaconda and virtual environments. Learn about the importance of text analysis, creating virtual environments, the text analysis pipeline, and real-world examples like the American Community Survey. Discover the case for text analysis, feature extraction, named entity recognition, text generation, and more. Dive into a variety of applications for analyzing text data efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Intro to Text Analysis Quant Methods Working Group Megan Wisniewski November 8, 2024
Overview Anaconda The case for text analysis The text analysis pipeline Group activity
Anaconda An open-source distribution for Python Automatically contains 250 packages as well as the conda Anaconda Prompt Anaconda Prompt: a command-line interface for managing conda environments and packages conda package manager
Virtual Environment A virtual environment virtual environment is an insulated space where one can work on specific Python projects. Packages installed in a virtual environment are isolated from other packages. Protects base Python Think of it as a folder!
Creating a Virtual Environment Open anaconda prompt Type: conda create -n UNIQUE_NAME python=3.11 anaconda spacy
The Case for Text Analysis An abundance of data! News outlets Digitized texts Social media: Twitter, Facebook, LinkedIn Open-ended survey questions Systematizing qualitative research
The Case for Text Analysis (contd) Feature extraction Named entity recognition Text generation ChatGPT Text summarization Counts, descriptives Text translation Text comparisons Similarity measures: Word Movers Distance, cosine similarity, etc.
The Text Analysis Pipeline 1. Acquire data 2. Structure text 3. Analyze text
Example: ACS Write-in Data The American Community Survey (ACS) is a nationally- representative survey administered to over 3.5 million addresses annually by the U.S. Census Bureau. The ACS collects verbatim responses from respondents regarding their job title and description of job duties. Both responses are used to assign a formal Census occupation code.
ACS Write-In Data The 2017 ACS Industry and Occupation Write-In File contains about 1.8 million cases. We primarily examined two questions regarding job title and job duties: Job title Job title: What was this person s main occupation? Job duties Job duties: Describe this person s most important activities or duties.
Data Cleaning: Abbreviations & Spelling We began with replacing abbreviations in the text file by creating a dictionary of common occupational abbreviations using data from O*NET sAlternate Titles resource and the National Processing Center. Example abbreviations: CPA for certified public accountant, PA for physician assistant, and RN for registered nurse We then used the SymSpellpackage to conduct spelling corrections.
Data Cleaning: Tokenization & Stop Words Next, we tokenized all text responses. Tokenization words in a sentence into small chunks or tokens . Example: [ FOR . MY , DAILY , TASKS , I , AM , ANALYZING , DATA , AND , CODING ] Tokenization is the process of separating We then removed all stop words add much substantiative information to the text. Stop words include pronouns, prepositions, and conjugations. Example with stop words: [ FOR . MY , DAILY , TASKS , I , AM , ANALYZING , DATA , AND , CODING ] Without stop words: [ DAILY , TASKS , ANALYZING , DATA , CODING ] stop words. Stop words are common words that do not
Data Cleaning: Part of Speech Tagging Part of speech tagging Part of speech tagging was conducted on the tokenized data, classifying each word as an adjective, verb, noun, or adverb. Example: [ DAILY , TASKS , ANALYZING , DATA , CODING ] [( DAILY , JJ )] JJ = adjective JJ = adjective [( TASKS , NNS )] NNS = noun plural NNS = noun plural [( ANALYZING , VBG )] VBG VBG = verb gerund = verb gerund [( DATA , NNS )] NNS = noun, plural NNS = noun, plural [( CODING , VBG )] VBG VBG = verb, gerund = verb, gerund
Data Cleaning: Lemmatization Part of speech tagging is conducted to improve the accuracy of lemmatization. Lemmatization Lemmatization converts a word to its base form. Example: [ DAILY , TASKS , ANALYZING , DATA , CODING ] [ DAILY , TASK , ANALYZE , DATA , CODE ] After data cleaning was completed, about 1.7 million records remained in the 2017 ACS Industry and Occupation Write-In File.
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 86 117
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 371,000 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 86 117
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 371,000 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 135,000 86 117
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 371,000 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 135,000 46,500 86 117
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 371,000 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 135,000 46,500 72,500 86 117
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 371,000 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 135,000 46,500 72,500 86 35,500 117
2017 ACS Write-in Data Number of Detailed Census Occupations Number of Job Titles (rounded) 535 371,000 All Management, Business, Science, and Arts Occupations Service Occupations Sales and Office Occupations Natural Resources, Construction, and Maintenance Occupations Production, Transportation, and Material Moving Occupations 194 68 70 135,000 46,500 72,500 86 35,500 117 82,000
Segregation Across Detailed Occupations and Job Titles 0.42 Management, Business, Science, and Arts Occupations 0.57 0.53 0.52 Service Occupations 0.65 0.58 0.41 Sales and Office Occupations 0.57 0.64 0.43 Natural Resources, Construction, and Maintenance Occupations 0.70 0.05 0.40 Production, Transportation, and Material Moving Occupations 0.67 0.22 Detailed Occupations Segregation Index Job Titles Segregation Index Prop. Female