
Web Crawling Challenges and Solutions at a Glance
Explore the intricacies of web crawling, including the role of crawlers, common challenges faced, techniques for maintaining page freshness, and addressing caching problems. Discover ideas to ensure the efficient retrieval and processing of web data in this informative collection.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CS246: Web Crawling Junghoo John Cho UCLA
What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages 2
Challenges Q: The process seems straightforward. Anything difficult? Is it just a matter of implementation? What are the issues? 3
Crawling Issues Load at the site Crawler should be unobtrusive to visited sites Load at the crawler Download billions of Web pages in short time Page selection Many pages, limited resources Page refresh Refresh pages incrementally not in batch 4
Page Refresh How can we maintain cached pages fresh ? The technique can be useful for web search engines, data warehouse, etc. Refresh Source Copy 5
Other Caching Problems Disk buffers Disk page, memory page buffer Memory hierarchy 1st level cache, 2nd level cache, Is Web caching any different? 6
Main Difference Origination of changes Cache to source Source to cache Freshness requirement Perfect caching Stale caching Role of a cache Transient space: cache replacement policy Main data source for application Refresh delay 7
Main Difference Limited refresh resources Many independent sources Network bandwidth Computational resources Mainly pull model 8
Ideas? Q: How can we maintain pages fresh ? What ideas can we explore to refresh pages well? Topic of our next discussion 9