Representation and Construction of Gender in Women Writers Project
Delve into the complexities of gender representation and construction in the Women Writers Project, exploring how gender interacts with racial and ethnic backgrounds. This study examines the grouping patterns based on gender and race/ethnicity, shedding light on societal categorizations. Additionally, the narratives of Margaret and Phillip Flower offer intriguing insights into historical perspectives on gender.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Extracting Patterns and Relations from the World Wide Web Sergey Brin WebDB 1998 16 Aug 2013 SNU IDB Lab. Inhoe Lee
Outline Introduction Approach Duality of patterns and relations DIPRE Algorithm Pattern generation Example Experiment Other issues Conclusions <2/14>
Introduction Google (by Larry Page and Sergey Brin) Rank Web pages by link authority Crawl plenty of Web data A small mirror of the World Wide Web Provide document content and hyperlink structure Web content mining What can we do for these data? Re-discover the information encoded by the authors Structure the data to more useful information <3/14>
Introduction World Wide Web provides a vast source of information Integrating information using coded wrappers or filter Can be time-consuming Goal Discover information source Extract the relevant information from them entirely automatically (very minimal human intervention) <4/14>
Constructing a Book DB A book is represented as a relation (title,author) Title Title Author Author Web Data The Robots of Dawn Startide Rising Chaos: Making a New Science Great Expectations The Comedy of Errors Issac Asimov David Brin James Gleick Clarles Dickens W. Shakespeare Challenges for extracting information from the Web Distributed information sources Many different formats <5/14>
Proposed Approach Duality of patterns and relations A good set of patterns a good set of tuples A good set of tuples a good set of patterns Title Title Author Author The Robots of Dawn Startide Rising Chaos: Making a New Science Great Expectations The Comedy of Errors Issac Asimov David Brin James Gleick Clarles Dickens W. Shakespeare Web Data Pattern Pattern < <LI><B> LI><B>title </B> by </B> by author ( ( < <I> I>title </I> by </I> by author ( ( DIPRE(Dual Iterative Pattern Relation Extraction) author || || title || ( || ( <6/14>
DIPRE Algorithm 1. R' <- Sample Start a small sample, R' of the target relation. 2. O <- FindOccurrences(R';D) Then, find all occurrences of tuples of R' in D. 3. P <- GenPatterns(O) Generate patterns based on the set of occurrences. 4. R' <- MD(P) Search the database for tuples matching any of the patterns. 5. If R' is large enough, return. Else go to step 2. <7/14>
Representation A book (title,author) Occurrences of books (author, title, order, url, prefix, middle, suffix) Patterns for books (order, urlprefix, prefix, middle, suffix) e.g., order=T, url urlprefix* *prefix, author, middle, title, suffix* Author: [A-Z][A-Za-z .,&]5,30[A-Za-z.] Title: [A-Z0-9][A-Za-z0-9 .,: #!?;&]4,45[A-Za-z0-9!] <8/14>
Pattern Generation GenOnePattern(O) Verify that the order and the middle are the same op.order order and op.middle middle op.urlprefix longest matching prefix of all the urls op.prefix longest matching suffix of all prefix s op.suffix longest matching prefix of all suffix s GenPattern(O) Split O into O1, ,Ok by order and middle For each group Oi, p GenOnePattern(Oi) If p meets specificity requirements then output p, Otherwise If all o in Oi have the same URL then reject Oi Else, split Oi into subgroups by the characters in their urls. Repeat step 2 for these subgroups. <9/14>
Example Seeds Title Title Author Author The Robots of Dawn Startide Rising Chaos: Making a New Science Great Expectations The Comedy of Errors Issac Asimov David Brin James Gleick Clarles Dickens W. Shakespeare fgrep title fgrep author www.sff.net/locus/c3.html <LI><B>The Robots of Dawn</B> by Issac Asimov (Bantam Spectra, Jan ’90) www.sff.net/locus/c5.html <LI><B>Startide Rising</B> by David Brin (Pulphouse, Jul ’90) <10/14>
Example Seeds Title Title Author Author Pattern Pattern The Robots of Dawn Startide Rising Chaos: Making a New Science Great Expectations The Comedy of Errors Issac Asimov David Brin James Gleick Clarles Dickens W. Shakespeare www.sff.net/locus/c* < <LI><B> LI><B>title </B> by </B> by author ( ( dns.city-net.com/---/hugos/1984.html < <I> I>title </I> by </I> by author ( ( dolphin.upenn.edu/---/sf-award.htm Regular Expression Matching author || || title || ( || ( fgrep title fgrep author GenPattern() www.sff.net/locus/c3.html Title Author Order url Prefix Middle suffix <LI><B>The Robots of Dawn</B> by Issac Asimov (Bantam Spectra, Jan ’90) The Robots of Dwan Issac Asimove F www.sff.net/locus/c3.html <LI><B> </B> by (Bantam Spectra, Jan ’90) www.sff.net/locus/c5.html Startide Rising (Pulphouse, Jul ’90) David Brin F www.sff.net/locus/c5.html <LI><B> </B> by <LI><B>Startide Rising</B> by David Brin (Pulphouse, Jul ’90) <11/14>
Experiment of Finding Books Total test data 24 million web pages totaling 147 gigabytes Experimental results (3 iterations) Iteration Books Occurrences Patterns 1 2 3 5 4047 9367 15257 9938 346 199 3 3972 105 Quality of results 19/20 are bonafide books 5/20 are not found on Amazon (2.5 million books) <12/14>
Other Issues Theoretical pattern evaluation Given a large database D, R is the target relation and R is an approximation of R Coverage: |R R| / |R| Error rate: |R -R| / |R | In the Web, a low error rate is more critical than high coverage Pattern specificity -log(P(X Md(p)), X is a random var. over a uniform dist. Estimation: specificity(p)=|middle||urlprefix||prefix||suffix| Performance Scalability <13/14>
Conclusions Introduces a approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations Inspires following work on bootstrapping for large-scale information extraction Apply another domain Movies, music, restaurants Mathematical background [DDF+90] Indexing by latent semantic analysis , Scott Deerwester, 1990 <14/14>
Conclusions Extracting information from the Web Data source Well known: wrapper Need to detect: discriminative patterns Data format Semi-structured: formatting hints, e.g., HTML tags Plane text: linguistic hints Pattern generation Hand-written: accurate but time-consuming Auto-generated: error-prone Dual Iterative Pattern Relation Extraction Back-end DB Query -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- Pattern/Template <15/14>