CS242 Project Fall 2016 Overview
This presentation covers the CS242 Project Fall 2016, discussing topics such as crawling resources, Lucene, Hadoop, web page parsing using Jsoup, and crawler ethics. Detailed information about project parts A and B due on different dates, along with strategies, limitations, and solutions are included. The use of text analyzers, integration with Lucene/Hadoop, and instructions for usage are highlighted. Additionally, the presentation provides insights into crawling web pages, handling data, and considerations for ethical crawling practices.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CS242 Project Fall 2016 Presented By Nhat Le Additional Thanks to Mohiuddin Abdul Qader, Waleed Amjad, Matthew Wiley, Shiwen Cheng and Eduardo Ruiz for Slides
Agenda Project overview Crawler resources Lucene Hadoop
Project Overview Part A due 11/02 5-10 GB data, how many items? Architecture Strategy Difficulties & Solutions, Limitation Instructions for Usage Crawler Index Fields -> why? Text Analyzer Choices Run time of index construction Limitations Instructions for Usage Lucene
Project Overview Part B due 11/30 Description of Hadoop Jobs Run time of Index Construction Ranking with Hadoop Index Difficulties & Solutions, Limitation Instructions for Usage Hadoop Minimal: search box and results section Option for Lucene/Hadoop Integration with Hadoop index Lucence QueryParser Integration Limitations Instructions for Usage and Screenshot Web UI
Crawling resources Webpages: Jsoup https://jsoup.org/ Or anything else similar, just not a full crawler Twiter: Twitter streaming API https://dev.twitter.com/streaming/overview Twitter4j http://twitter4j.org/en/index.html
Crawling webpages getNext 1 Download contents of page Frontier www.cs.ucr.edu www.cs.ucr.edu/~v agelis Parse the downloaded file to extract links the page 2 getNext() addAll(List) Clean and Normalize the extracted links 3 Add(List<URLs>) Store extracted links in the Frontier 4
Crawling webpages - Jsoup https://jsoup.org/ parse HTML from a URL, file, or string find and extract data, using DOM traversal or CSS selectors Don t need to worry about details of parsing HTML files: unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
Crawler Ethics Some websites don t want crawlers swarming all over them. Why? Increases load on the server Private websites Dynamic websites How does the website tell you (crawler) if and what is off limits. Two options Site wide restrictions: robots.txt Webpage specific restrictions: Meta tag
Crawler Ethics How does the website tell you (crawler) if and what is off limits. Two options Site wide restrictions: robots.txt Webpage specific restrictions: Meta tag
Crawler Ethics robots.txt A file called robots.txt in the root directory of the website Example: http://www.about.com/robots.txt Format: User-Agent: <Crawler name> Disallow: <don t follow path> Allow: <can-follow-paths>
Crawler Ethics Website Specific: Meta tags Some webpages have one the following meta-tag entries: <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> Options: INDEX or NOINDEX FOLLOW or NOFOLLOW
Twitter data collecting Collecting through Twitter Streaming API https://dev.twitter.com/docs/platform-objects/tweets, where you can check the data schema. Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get about 4.3M tweets per day (about 2GB) You need to have a Twitter account for that. Check https://dev.twitter.com/ Not REST API
Third-party libarary - Twitter4j You can find supports for other languages also. Well documented and code examples. e.g., http://twitter4j.org/en/code-examples.html
Important Fields At least following fields you should save: Text Timestamp Geolocation User of the tweet Links Hashtag
Last words about crawler Don t forget mention in bold what you have done extra to get bonus points. Multi Threads Duplicate Page Handing etc. Efficient downloading of the links in tweets. Etc.
http://lucene.apache.org/core/ http://lucene.apache.org/pylucene/ Note, only use the Lucene Core (or PyLucene), Solr is not acceptable
Lucene Apache LuceneTMis a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross- platform. Multiple-index Ranked searching -- best results returned first Query parser Fast and effient Latest version: 6.2.1
Lucene in a search system Index document Users Analyze document Search UI Build document Index Build query Render results Acquire content Run query Raw Content Image from Chris Manning and Pandu Nayak slide
Lucene An Information Retrieval Library Documents E.g., a webpage, tweet, image Fields Webpage: Title, Content, URL, Metadata, Tweets: Content, Hashtags, URL, Location, Time, Images: Name, Alternate Text, URL, Indexing: Index document from data Call IndexWriter Searching: Create query, search with created index. Call IndexSearcher
Build Index Searching
Creating Index Analyzer: Controls tokenization, stemming, and stop words Standard Analyzer Sufficient in most cases Directory: where to save index RAMDirectory FSDirectory IndexWriter Add Document
Searching Read Index QueryParser Analyzer IndexSearcher
Dependencies for running this example Lucence 6.2.1 lucene-core lucene-analyzers-common lucene-queryparser Java 8 Java 7 or older doesn t work for me https://lucene.apache.org/core/6_2_1/core/overview- summary.html#overview.description
Query Semantics Basic Query: Boolean queries: Field/Term Boosting: Proximity search Range search Wildcard search
Lucene Scoring Default is a variant of Tf-Idf scoring model Lucene also supports Vector Space Model, BM25, Language Models http://lucene.apache.org/core/6_2_1/core/org/apache/lucene/search/simila rities/package-summary.html Custom scoring also available, subclass DefaultSimilarity: IDF unimportant: More matches in title important:
Other details in Lucene API How to choose: Fields Analyzer Scoring function
Hadoop@dblab Several Slides Authored by Eduardo J Ruiz
Hadoop Faul tolerant storage for Big Data. Distributed File System (HDFS) Replication on multiple machines (Asynchronous) Parallel I/O (No RAID) I/O Balancing Distributed system for analyze large quantities of data Parallel Processing (Map/Reduce) Data-Aware Job Sheduling Failure Handling (Tracking, Restarting) Source: Intro to Hadoop (Eric Wendeil)
Working With HDFS Start Here hdfs dfs ls /users/eruiz009 hdfs dfs copyFromLocal tweets.txt /users/eruiz009 Copy to HDFS hdfs dfs ls /users/eruiz009/tweets.txt Check File hdfs dfs tail /users/eruiz009/tweets.txt Tail Move to Local hdfs dfs copyToLocal /users/eruiz009/tweets.txt . hdfs dfs rm /users/eruiz009/tweets.txt Remove
HDFS Java API // Get default file system instance fs = Filesystem.get(new Configuration()); // Or Get file system instance from URI fs = Filesystem.get(URI.create(uri), new Configuration()); // Create, open, list, ... OutputStream out = fs.create(path, ...); InputStream in = fs.open(path, ...); boolean isDone = fs.delete(path, recursive); FileStatus[] fstat = fs.listStatus(path);
A Generic Processing Process ... Iterate over a large number of records Extract something of interest (Map) Shuffle and sort intermediate results Aggregate intermediate results (Reduce) Generate Final Output (Dean and Ghemawat, OSDI 2004)
Our Hadoop Installation Dblabrack10-14 JobTracker DataNode Dblabrack15 TaskTracker NameNode SecondaryNameN ode Exact clones (6TB, 16GB) Single Node (3TB, Mirror, 64GB)
MapReduce by Example http://hadoop.apache.org/docs/current/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Creating a MapReduce Job Three parts: Mapper -- emits <key, value> pairs Reducer -- given each pair, perform a computation Driver -- define the data types, mapper, reducer Record-oriented input e.g. 1 value per line <key,value> output
Customizing a MapReduce Job Datatypes for keys and values Input format Output format Partitioning of mapper output Combiners process mapper output in memory Reducers = combiners if associative and commutative
Running Classpath contains core and dependencies Local javac cp $CLASSPATH WordCount.java tweets.txt tweets_counts Build a jar and ship it! (Add your program dependencies) Cluster hadoop jar WordCount.jar /users/eruiz009/tweets.txt /users/eruiz009/tweets_counts See http://wiki.apache.org/hadoop/WordCount for full code
Best advises? Google, Google and Google Stackoverflow