ICEWEB 2 - High-Quality Web-Based Components for ICE Corpora

ICEWEB 2 - High-Quality Web-Based Components for ICE Corpora
Slide Note
Embed
Share

A new way of compiling high-quality web-based components for ICE Corpora, developed by Martin Weisser. This innovative tool offers improved methods and motivation for creating large-scale web corpora, with enhanced features and user control over data handling. Discover the background, methods, and motivation behind ICEWEB 2.

  • ICEWEB
  • Web Components
  • Corpora
  • Martin Weisser
  • Linguistics

Uploaded on Mar 01, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ICEWEB 2 A NEW WAY OF COMPILING HIGH-QUALITY WEB-BASED COMPONENTS FOR ICE CORPORA Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong University of Foreign Studies weissermar@hotmail.com, http://martinweisser.org

  2. OUTLINE motivation methods & issues background ICEweb ver. 1 ICEweb ver. 2 overview details

  3. METHODS & ISSUES creation of large-scale web corpora often done via seed terms and automated downloads advantage easy way to generate masses of corpus data disadvantages relative lack of control over results blind faith need for sophisticated means of identifying duplicates & boilerplate removal for smaller, more qualitative corpora, different (cyclical) approach better de-/re-fining seed terms visual inspection and selection of results download & editing

  4. MOTIVATION new efforts to enhance/augment ICE corpora own plans to carry out automated speech-act analysis on written data larger variety of public and private texts available online web scraping more convenient that other forms of procuring electronic texts, such as converting from word processing formats, PDFs, or scanning some HTML markup useful in identifying structural elements, such as headings, etc., for more advanced corpus markup & processing

  5. BACKGROUND ICEWEB VER. 1 created 2008, released in 2013 colourful ;-), but a few disadvantages not really designed around ICE categories relatively complex way of setting up data structures no support for seeding lack of user control over processing/data handling options

  6. ICEWEB VER. 2 includes original ICE categories, but can be augmented more intuitive way of selecting regions/countries & categories most relevant folders get created automatically new, user configurable, options for setting defaults, etc., via conf file processing steps separated better HTML handling, including boilerplate removal enhanced analysis features (n-grams & concordancing)

  7. DETAILS FINDING SUITABLE PAGES steps choose location & category select search engine choose seed terms launch browser + search engine collect URLs in URL editor save

  8. DETAILS RETRIEVAL steps switch to Retrieval tab click Get web pages ICEweb attempts to retrieve URLS listed in URL file, reporting success or failure creates/appends to CSV file containing original title, URL, local filename, & download time

  9. DETAILS WORKING WITH DATA (1) downloaded HTML can be converted to raw text XML (TART format) tagged text files of a specific type can be listed via relevant folder button edited in built-in editor corresponding HTML viewed in browser

  10. DETAILS WORKING WITH DATA (2) Text Corresponding HTML

  11. DETAILS N-GRAM ANALYSIS options to produce wordlists or n- grams from raw text or XML adjustable norming factor (default 1,000) raw, normed & document frequency output filterable via regex (e.g. ^[A-Z] only n-grams starting with Proper names or sentence-initial) hyperlinked n-gram triggers concordance

  12. DETAILS CONCORDANCING line-based 2 search terms possible adjustable context lines before/after relative position of second term punctuation interpolated when triggered from n-grams tab hyperlinked to editor

  13. QUESTIONS OR SUGGESTIONS?

More Related Content