Text Representation and Mining for Business Intelligence and Analytics

business intelligence and analytics representing n.w
1 / 27
Embed
Share

Discover the challenges and strategies in representing and mining text data for business intelligence and analytics. Learn about text preprocessing, term frequency, IDF, and the Bag of Words approach. Understand why text data can be difficult to work with and the importance of proper text representation. Dive into the world of natural language processing to unlock valuable insights from unstructured text data.

  • Text Mining
  • Data Analytics
  • Natural Language Processing
  • Text Representation
  • Business Intelligence

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Business Intelligence and Analytics: Representing and Mining Text Session 11

  2. Agenda Text Representation Text Preprocessing Term Frequency and IDF Case Study

  3. Dealing with Text Dataarerepresentedinwaysnaturaltoproblemsfromwhichthey werederived Vastamountoftext.. Ifwewanttoapplythemanydataminingtoolsthatwehaveatour disposal,wemust eitherengineerthedatarepresentationtomatchthetools (representationengineering),or buildnewtoolstomatchthedata

  4. Why Text is Difficult Text is unstructured Linguisticstructureisintendedforhumancommunicationandnot computers Wordordermatterssometimes Textcanbedirty Peoplewriteungrammatically,misspellwords,abbreviateunpredictably, andpunctuaterandomly Synonyms,homograms,abbreviations,etc. Contextmatters

  5. Text Representation Goal:Takeasetofdocuments eachofwhichisarelativelyfree- formsequenceofwords andturnitintoourfamiliarfeature-vector form Acollectionofdocumentsiscalledacorpus Adocumentiscomposedofindividualtokensorterms Eachdocumentisoneinstance butwe don t know in advance what the features will be

  6. Bag of Words Treateverydocumentasjustacollectionofindividualwords Ignoregrammar,wordorder,sentencestructure,and(usually) punctuation Treateverywordinadocumentasapotentiallyimportantkeywordof thedocument What will be the feature s value in a given document? Eachdocumentisrepresentedbyaone(ifthetokenispresentinthe document)orazero(thetokenisnotpresentinthedocument) Straightforwardrepresentation Inexpensivetogenerate Tendstoworkwellformanytasks

  7. Pre-processing of Text Thefollowingstepsshouldbeperformed: Thecaseshouldbenormalized Everytermisinlowercase Wordsshouldbestemmed Suffixesareremoved E.g.,nounpluralsaretransformedtosingularforms Stop-wordsshouldberemoved Astop-wordisaverycommonwordinEnglish(orwhateverlanguageis beingparsed) Typicalwordssuchasthewordsthe,and,of,andonareremoved

  8. Term Frequency Usethewordcount(frequency)inthedocumentinsteadofjusta zeroorone Differentiatesbetweenhowmanytimesawordisused

  9. Normalized Term Frequency Documentsofvariouslengths Wordsofdifferentfrequencies Wordsshouldnotbetoocommonortoorare Bothupperandlowerlimitonthenumber(orfraction)ofdocumentsin whichawordmayoccur Featureselectionisoftenemployed Therawtermfrequenciesarenormalizedinsomeway, suchasbydividingeachbythetotalnumberofwordsinthedocument orthefrequencyofthespecificterminthecorpus

  10. TF-IDF TFIDF ?,? = TF ?,? IDF ? Inverse Document Frequency (IDF) of a term Totalnumber of documents Number of documents containing? IDF ? = 1 + log

  11. TFIDF Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  12. Example: Jazz Musicians 15prominentjazzmusiciansandexcerptsoftheirbiographiesfrom Wikipedia Nearly2,000featuresafterstemmingandstop-wordremoval! Consider the sample phrase Famous jazz saxophonistbornin Kansaswhoplayedbebopandlatin

  13. Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  14. Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  15. Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  16. Example: Jazz Musicians

  17. Beyond Bag of Words ?-gram Sequences Named Entity Extraction Topic Models

  18. N-gram Sequences Insomecases,wordorderisimportantandyouwanttopreserve someinformationaboutitintherepresentation Anextstepupincomplexityistoincludesequencesofadjacent wordsasterms Adjacentpairsarecommonlycalledbi-grams Example: The quick brown fox jumps Itwouldbetransformedinto{quick,brown,fox,jumps,quick_brown, brown_fox,fox_jumps} N-gramstheygreatlyincreasethesizeofthefeatureset

  19. Topic Models Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  20. Text Mining Example Task:predictthestockmarketbasedonthestoriesthatappearonthe newswires

  21. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  22. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  23. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  24. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  25. Mining News Stories to Predict Stock Price Movement

  26. Thank You Thank You

  27. References Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking. O Reilly, CA 95472, 2013. Eibe Frank, Mark A. Hall, and Ian H. Witten : The Weka Workbench, M Morgan Kaufman Elsevier, 2016. Jason Brownlee, Machine Learning Mastery With Weka, E-Book, 2017 Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson.

Related


More Related Content