
Building a Semantic Parser Overnight - Data Challenges & Solutions
Explore the journey of building a semantic parser overnight, facing data challenges with limited samples and lack of critical functionality. Discover solutions for seed lexicon, domain vocabulary, and logical forms, paving the way for a powerful semantic parser.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Building a Semantic Parser Overnight
Which country has the highest CO2 emissions? Which had the highest increase since last year? What fraction is from the five countries with highest GDP?
The data problem: The main database is 600 samples (GEO880) To compare: Labeled photos: millions
Not only quantity: The data can lack critical functionality
The process Domain Seed lexicon Logical forms and canonical utterances Paraphrases Semantic parser
The data base: Triples (e1, p, e2) e1 and e2 are entities (e.g., article1, 2015) p is a property (e.g., publicationDate)
Seed lexicon For every property, a lexical entry of the form <t s[p]> t is a natural language phrase and s is a syntactic category < publication date RELNP[publicationDate]>
Seed lexicon In addition, L contains two typical entities for each semantic type in the database <alice NP[alice]>
Unary TYPENP ENTITYNP Verb phrases VP ( has a private bath ) Binaries: RELNP functional properties (e.g., publication date ) VP/NP transitive verbs ( cites , is the president of )
Grammar < 1 . . . n s[z]> 1 . . . n tokens or categories, s is a syntactic category z is the logical form constructed
Grammar <RELNP[r] of NP[x] NP[R(r).x]> Z: R(publicationDate).article1 C: publication date of article 1
Crowdsourcing X: when was article 1 published? D = {(x, c, z)} for each (z, c) GEN(G L) and x P(c)
Training log-linear distribution p (z, c | x, w)
Lambda DCS Entity: singleton set {e} Property: set of pairs (e1, e2)
Lambda DCS binary b and unary u join b.u ?2 ?? ?1 ,?2 ??
Lambda DCS u ?1 ?2 ?1 ?2
Lambda DCS R(b) (e1, e2) [b] -> (e2, e1) [R(b)]
Lambda DCS count(u) sum(u) average(u, b) argmax(u, b)
Lambda DCS x.u is a set of (e1, e2): e1 [u[x/e2]]w R( x.count(R(cites).x)) (e1, e2), where e2 is the number of entities that e1 cites.
Seed lexicon article publication date cites won an award
Grammar Assumption 1 (Canonical compositionality): Using a small grammar, all logical forms expressible in natural language can be realized compositionally based on the logical form.
Grammar Functionality-driven Generate superlatives, comparatives, negation, and coordination
Grammar From seed: types, entities, and properties noun phrases (NP) verbs phrases (VP) complementizer phrase (CP) that cites Building a Semantic Parser Overnight that cites more than three article
Paraphrasing meeting whose attendee is alice meeting with alice author of article 1 who wrote article 1 player whose number of points is 15 player who scored 15 points
Paraphrasing article that has the largest publication date newest article . housing unit whose housing type is apartment apartment university of student alice whose field of study is music At which university did Alice study music? , Which university did Alice attend?
Sublexical compositionality parent of alice whose gender is female mother of alice . person that is author of paper whose author is X co-author of X person whose birthdate is birthdate of X person born on the same day as X . meeting whose start time is 3pm and whose end time is 5pm meetings between 3pm and 5pm that allows cats and that allows dogs that allows pets author of article that article whose author is X cites who does X cite .
Crowdsourcing in numbers Each turker paraphrased 4 utterances 28 seconds on average per paraphrase 38,360 responses 26,098 examples remained
Paraphrasing noise in the data 17% noise in the data 17% ( player that has the least number of team player with the lowest jersey number ) ( restaurant whose star rating is 3 stars hotel which has a 3 star rating ).
Model and Learning numbers, dates, and database entities first
Model and Learning (z, c) GEN(G Lx) ??( z, c | x, w) exp( (c, z, x, w) > )
Model and Learning Features
Model and Learning ???? ?,? ?,? ? ? 1 ?,?,? ? AdaGrad (Duchi et al., 2010)