
Text Representation and Mining for Business Intelligence and Analytics
Discover the challenges and strategies in representing and mining text data for business intelligence and analytics. Learn about text preprocessing, term frequency, IDF, and the Bag of Words approach. Understand why text data can be difficult to work with and the importance of proper text representation. Dive into the world of natural language processing to unlock valuable insights from unstructured text data.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Business Intelligence and Analytics: Representing and Mining Text Session 11
Agenda Text Representation Text Preprocessing Term Frequency and IDF Case Study
Dealing with Text Dataarerepresentedinwaysnaturaltoproblemsfromwhichthey werederived Vastamountoftext.. Ifwewanttoapplythemanydataminingtoolsthatwehaveatour disposal,wemust eitherengineerthedatarepresentationtomatchthetools (representationengineering),or buildnewtoolstomatchthedata
Why Text is Difficult Text is unstructured Linguisticstructureisintendedforhumancommunicationandnot computers Wordordermatterssometimes Textcanbedirty Peoplewriteungrammatically,misspellwords,abbreviateunpredictably, andpunctuaterandomly Synonyms,homograms,abbreviations,etc. Contextmatters
Text Representation Goal:Takeasetofdocuments eachofwhichisarelativelyfree- formsequenceofwords andturnitintoourfamiliarfeature-vector form Acollectionofdocumentsiscalledacorpus Adocumentiscomposedofindividualtokensorterms Eachdocumentisoneinstance butwe don t know in advance what the features will be
Bag of Words Treateverydocumentasjustacollectionofindividualwords Ignoregrammar,wordorder,sentencestructure,and(usually) punctuation Treateverywordinadocumentasapotentiallyimportantkeywordof thedocument What will be the feature s value in a given document? Eachdocumentisrepresentedbyaone(ifthetokenispresentinthe document)orazero(thetokenisnotpresentinthedocument) Straightforwardrepresentation Inexpensivetogenerate Tendstoworkwellformanytasks
Pre-processing of Text Thefollowingstepsshouldbeperformed: Thecaseshouldbenormalized Everytermisinlowercase Wordsshouldbestemmed Suffixesareremoved E.g.,nounpluralsaretransformedtosingularforms Stop-wordsshouldberemoved Astop-wordisaverycommonwordinEnglish(orwhateverlanguageis beingparsed) Typicalwordssuchasthewordsthe,and,of,andonareremoved
Term Frequency Usethewordcount(frequency)inthedocumentinsteadofjusta zeroorone Differentiatesbetweenhowmanytimesawordisused
Normalized Term Frequency Documentsofvariouslengths Wordsofdifferentfrequencies Wordsshouldnotbetoocommonortoorare Bothupperandlowerlimitonthenumber(orfraction)ofdocumentsin whichawordmayoccur Featureselectionisoftenemployed Therawtermfrequenciesarenormalizedinsomeway, suchasbydividingeachbythetotalnumberofwordsinthedocument orthefrequencyofthespecificterminthecorpus
TF-IDF TFIDF ?,? = TF ?,? IDF ? Inverse Document Frequency (IDF) of a term Totalnumber of documents Number of documents containing? IDF ? = 1 + log
TFIDF Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians 15prominentjazzmusiciansandexcerptsoftheirbiographiesfrom Wikipedia Nearly2,000featuresafterstemmingandstop-wordremoval! Consider the sample phrase Famous jazz saxophonistbornin Kansaswhoplayedbebopandlatin
Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Beyond Bag of Words ?-gram Sequences Named Entity Extraction Topic Models
N-gram Sequences Insomecases,wordorderisimportantandyouwanttopreserve someinformationaboutitintherepresentation Anextstepupincomplexityistoincludesequencesofadjacent wordsasterms Adjacentpairsarecommonlycalledbi-grams Example: The quick brown fox jumps Itwouldbetransformedinto{quick,brown,fox,jumps,quick_brown, brown_fox,fox_jumps} N-gramstheygreatlyincreasethesizeofthefeatureset
Topic Models Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Text Mining Example Task:predictthestockmarketbasedonthestoriesthatappearonthe newswires
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement
Thank You Thank You
References Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking. O Reilly, CA 95472, 2013. Eibe Frank, Mark A. Hall, and Ian H. Witten : The Weka Workbench, M Morgan Kaufman Elsevier, 2016. Jason Brownlee, Machine Learning Mastery With Weka, E-Book, 2017 Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson.