FlashExtract: A General Framework for Data Extraction by Examples
Framework overview: FlashExtract is a versatile framework designed for efficient data extraction using examples. Developed by Vu Le and Sumit Gulwani, FlashExtract streamlines the extraction process by allowing users to define extraction strategies through a grammar-based approach. The framework incorporates various features such as field extraction programs, output schema definitions, and core algebra operators. With a focus on simplicity and effectiveness, FlashExtract empowers users to extract and process data effectively.
Uploaded on Mar 03, 2025 | 2 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
FlashExtract : A General Framework for Data Extraction by Examples Vu Vu Le Sumit Gulwani (MSR) Le (UC Davis)
motivation motivation ..
demo demo
schema extraction program schema extraction program o Output schema o Field extraction programs for all fields in the schema
output schema output schema o XML-like: sequence and structure Seq Seq([blue blue] Struct Struct(Name: [green green] String, City: [yellow yellow] String))
field extraction program field extraction program o An ancestor o A program in the DSL Examples o Green Green = <Blue Blue, PRegion> o Yellow Yellow = < , PSeqRegion>
data extraction DSL data extraction DSL o DSL is a tuple (G, N1, N2) o G : grammar defining extraction strategies o N1 : top-level SeqRegion nonterminal o N2 : top-level Region nonterminal o Each non-terminal has a learn method
core algebra core algebra o Decomposable Map Operator o Filter Operators o Merge Operator o Pair Operator
city example city example
city example city example Filter Filter lines that end with WA 1. 1.
city example city example Filter Filter lines that end with WA 2. 2. Map Map each selected line to a pair pair of positions 1. 1.
city example city example Filter Filter lines that end with WA 2. 2. Map Map each selected line to a pair pair of positions 3. Learn two leaf exprs for the two positions 1. 1.
learning algorithm learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line
learning algorithm learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line o Learn lines = learn a Boolean filter
inductive synthesis inductive synthesis Problem Problem Definition Definition: Identify a vertical domain of tasks that users struggle with 1. 1. Domain Domain- -Specific describe tasks in that domain Specific Language (DSL Language (DSL) ): Design a DSL that can succinctly 2. 2. Synthesis Algorithm Synthesis Algorithm: Develop an algorithm that can efficiently translate examples into likely programs in DSL 3. 3. Machine Learning Machine Learning: Rank the various programs 4. 4. User Interface User Interface: Provide an appropriate interaction mechanism to resolve ambiguities 5. 5.
pros & cons pros & cons o Advantages o Efficient Efficient synthesizer Easier ranking control Tighter integration with user interaction model o Easier o Tighter o Disadvantages o Non Non- -constructive constructive: require thinking & implementation Non- -modular modular: DSL is not extensible o Non
inductive inductive meta meta- -synthesis synthesis o A synthesizer for a related family of DSLs that supports a common common user interaction model o Alleviate Alleviate disadvantages of the generic methodology
inductive inductive meta meta- -synthesis synthesis o Identify a family family of vertical task domains o Design an algebra algebra for DSLs o Implement a search algebra operator search algorithm algorithm for each
inductive inductive meta meta- -synthesis synthesis o Identify a family family of vertical task domains o Design an algebra algebra for DSLs o Implement a search algebra operator search algorithm algorithm for each
extraction extraction meta meta- -synthesis synthesis o Identify a family of vertical task domains o Extraction of semi Extraction of semi- -structured documents structured documents o Design an algebra for DSLs o Merge, Map, Merge, Map, FilterBool FilterBool, , FilterInt FilterInt, Pair , Pair o Implement a search algorithm for each algebra operator o Compositional Compositional and and inductive learners inductive learners
synthesis synthesis algorithm algorithm o Top-down o Top-level SeqRegion, Region symbols N1, N2 o Grammar-guided o Grammar built from the algebra operators
key insight key insight o Reduce Reduce learning task for an expression to learning tasks for its sub-expressions o Examples: Learn Map ( x : F, S) o Learn the scalar expression F o Learn the sequence expression S
instantiations instantiations o Text files o Web pages o Spreadsheets
demo demo
evaluation evaluation o Can FlashExtract extract data from real real- -world world files? o How many interactions interactions typically required? o How efficient/real efficient/real- -time time is FlashExtract?
expressiveness expressiveness o Can FlashExtract extract data from real real- -world world files? o How many interactions typically required? o How efficient/real-time is FlashExtract?
benchmarks benchmarks o 25 text files o System log files o Copied texts from web pages and PDFs o Samples from Pro Perl Parsing o 25 webpages from [1] o Add two more test cases for each web page o 25 spreadsheets o 7 from [2] that are applicable for extracting o 18 from EUSES corpus [1] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. Proc. VLDB Endow., 2010. [2] B. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.
effectiveness effectiveness o Can FlashExtract extract data from real-world files? Yes Yes o How many interactions interactions typically required? 2.36 examples 2.36 examples o How efficient/real-time is FlashExtract?
efficiency efficiency o Can FlashExtract extract data from real-world files? Yes Yes o How many interactions typically required? 2.36 examples 2.36 examples o How efficient/real efficient/real- -time time is FlashExtract? 0.82s last interaction 0.82s last interaction
conclusion conclusion o Inductive meta-synthesis o FlashExtract is general general o Text file, web page, spreadsheet instantiations o FlashExtract is practical practical o Extract real-world data, in real time, within a few examples
thank you thank you Questions?