FlashExtract: A General Framework for Data Extraction by Examples

FlashExtract: A General Framework for Data Extraction by Examples
Slide Note
Embed
Share

Framework overview: FlashExtract is a versatile framework designed for efficient data extraction using examples. Developed by Vu Le and Sumit Gulwani, FlashExtract streamlines the extraction process by allowing users to define extraction strategies through a grammar-based approach. The framework incorporates various features such as field extraction programs, output schema definitions, and core algebra operators. With a focus on simplicity and effectiveness, FlashExtract empowers users to extract and process data effectively.

  • Data Extraction
  • Framework
  • Examples
  • Schema
  • Programming

Uploaded on Mar 03, 2025 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. FlashExtract : A General Framework for Data Extraction by Examples Vu Vu Le Sumit Gulwani (MSR) Le (UC Davis)

  2. motivation motivation ..

  3. demo demo

  4. schema extraction program schema extraction program o Output schema o Field extraction programs for all fields in the schema

  5. output schema output schema o XML-like: sequence and structure Seq Seq([blue blue] Struct Struct(Name: [green green] String, City: [yellow yellow] String))

  6. field extraction program field extraction program o An ancestor o A program in the DSL Examples o Green Green = <Blue Blue, PRegion> o Yellow Yellow = < , PSeqRegion>

  7. data extraction DSL data extraction DSL o DSL is a tuple (G, N1, N2) o G : grammar defining extraction strategies o N1 : top-level SeqRegion nonterminal o N2 : top-level Region nonterminal o Each non-terminal has a learn method

  8. core algebra core algebra o Decomposable Map Operator o Filter Operators o Merge Operator o Pair Operator

  9. city example city example

  10. city example city example Filter Filter lines that end with WA 1. 1.

  11. city example city example Filter Filter lines that end with WA 2. 2. Map Map each selected line to a pair pair of positions 1. 1.

  12. city example city example Filter Filter lines that end with WA 2. 2. Map Map each selected line to a pair pair of positions 3. Learn two leaf exprs for the two positions 1. 1.

  13. learning algorithm learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line

  14. learning algorithm learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line o Learn lines = learn a Boolean filter

  15. inductive synthesis inductive synthesis Problem Problem Definition Definition: Identify a vertical domain of tasks that users struggle with 1. 1. Domain Domain- -Specific describe tasks in that domain Specific Language (DSL Language (DSL) ): Design a DSL that can succinctly 2. 2. Synthesis Algorithm Synthesis Algorithm: Develop an algorithm that can efficiently translate examples into likely programs in DSL 3. 3. Machine Learning Machine Learning: Rank the various programs 4. 4. User Interface User Interface: Provide an appropriate interaction mechanism to resolve ambiguities 5. 5.

  16. pros & cons pros & cons o Advantages o Efficient Efficient synthesizer Easier ranking control Tighter integration with user interaction model o Easier o Tighter o Disadvantages o Non Non- -constructive constructive: require thinking & implementation Non- -modular modular: DSL is not extensible o Non

  17. inductive inductive meta meta- -synthesis synthesis o A synthesizer for a related family of DSLs that supports a common common user interaction model o Alleviate Alleviate disadvantages of the generic methodology

  18. inductive inductive meta meta- -synthesis synthesis o Identify a family family of vertical task domains o Design an algebra algebra for DSLs o Implement a search algebra operator search algorithm algorithm for each

  19. inductive inductive meta meta- -synthesis synthesis o Identify a family family of vertical task domains o Design an algebra algebra for DSLs o Implement a search algebra operator search algorithm algorithm for each

  20. extraction extraction meta meta- -synthesis synthesis o Identify a family of vertical task domains o Extraction of semi Extraction of semi- -structured documents structured documents o Design an algebra for DSLs o Merge, Map, Merge, Map, FilterBool FilterBool, , FilterInt FilterInt, Pair , Pair o Implement a search algorithm for each algebra operator o Compositional Compositional and and inductive learners inductive learners

  21. synthesis synthesis algorithm algorithm o Top-down o Top-level SeqRegion, Region symbols N1, N2 o Grammar-guided o Grammar built from the algebra operators

  22. key insight key insight o Reduce Reduce learning task for an expression to learning tasks for its sub-expressions o Examples: Learn Map ( x : F, S) o Learn the scalar expression F o Learn the sequence expression S

  23. instantiations instantiations o Text files o Web pages o Spreadsheets

  24. demo demo

  25. evaluation evaluation o Can FlashExtract extract data from real real- -world world files? o How many interactions interactions typically required? o How efficient/real efficient/real- -time time is FlashExtract?

  26. expressiveness expressiveness o Can FlashExtract extract data from real real- -world world files? o How many interactions typically required? o How efficient/real-time is FlashExtract?

  27. benchmarks benchmarks o 25 text files o System log files o Copied texts from web pages and PDFs o Samples from Pro Perl Parsing o 25 webpages from [1] o Add two more test cases for each web page o 25 spreadsheets o 7 from [2] that are applicable for extracting o 18 from EUSES corpus [1] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. Proc. VLDB Endow., 2010. [2] B. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.

  28. effectiveness effectiveness o Can FlashExtract extract data from real-world files? Yes Yes o How many interactions interactions typically required? 2.36 examples 2.36 examples o How efficient/real-time is FlashExtract?

  29. efficiency efficiency o Can FlashExtract extract data from real-world files? Yes Yes o How many interactions typically required? 2.36 examples 2.36 examples o How efficient/real efficient/real- -time time is FlashExtract? 0.82s last interaction 0.82s last interaction

  30. conclusion conclusion o Inductive meta-synthesis o FlashExtract is general general o Text file, web page, spreadsheet instantiations o FlashExtract is practical practical o Extract real-world data, in real time, within a few examples

  31. thank you thank you Questions?

Related


More Related Content