Distributed Web Crawler and Hadoop-based Research

hadoop based distributed web crawler n.w

1 / 9

Embed Share

Explore the innovative Hadoop-based Distributed Web Crawler developed by the Zhen-Feng Shi Search Engine Group at the Research Center of Intelligent Internet of Things, Shanghai Jiaotong University. Discover the motivation behind this project, its current progress, and future work, including enhancements for academic search engines and automation. Learn about the use of Hadoop in processing nodes and the benefits of open-source software for reliable and scalable distributed computing.

zah_wes Follow

Uploaded on Jul 07, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Hadoop-based Distributed Web Crawler Zhen-Feng Shi Search Engine Group Research Center of Intelligent Internet of Things IIoT) Department of Electronic Engineering Shanghai Jiaotong Univ. 1

Look Ahead Motivation Brief Introduction to Hadoop Distributed Web Crawler Current Work Future Work 2

Motivation Data Academic Search Engine Motivation Hadoop Distributed Web Crawler Current Work Future Work Fetch Data? IEEE Systematic Crawler 3

Motivation Why not Nutch? o A well matured, production ready Web crawler o Great for batch processing Motivation Hadoop Two thirds of the system is designed for search engine o No helpful to accurate extraction o Instead of the title of HTML, we need Paper titles, authors, abstracts, DOI, etc Distributed Web Crawler Current Work Future Work Secondary development == Complete destruction 4

open-source software o Reliable, scalable, distributed computing Motivation Four modules o Hadoop Common o Hadoop Distributed File System (HDFS o Hadoop YARN o Hadoop MapReduce Map()-- filtering and sorting Reduce() --summary operation Hadoop Distributed Web Crawler ) Current Work Future Work Web Crawler -- Not a common application of Hadoop o Zero input, infinity output o Most of works done in Map() 5

Distributed Web Crawler Crawled List IEEE Watch Dog(Job Scheduler) Motivation Processing Node 1 Hadoop Data Warehouse ACM Distributed Web Crawler Processing Node 2 MAG Current Work Processing Node N Future Work Arxiv /Processing 6

Current Work Cluster with Hadoop o One master o Two Slaves Motivation Hadoop Distributed Web Crawler A running crawler on Hadoop Current Work Future Work Efficiency Comparison o 2 Cores + multi thread : 1.92 paper / sec o Hadoop : 6.5 paper / sec Production o CiNii Articles: 1 million papers o ERIC: 1.3 million papers 7

Future Work More sources o SAO/NASA o Europe PMC o CiteSeerX o Motivation Hadoop Distributed Web Crawler Automation and Surveillance o Few intervention o Always up to date Current Work Future Work 8

Thanks Q&A 9

Distributed Web Crawler and Hadoop-based Research

Download Presentation

Presentation Transcript

Related

More Related Content