Advanced Techniques for Tag Attribute and Value Specific Data Scraping

data scraping ii n.w
1 / 23
Embed
Share

Learn advanced techniques for data scraping by targeting specific tags with attributes and values in web documents. Find out how to search for specific attributes, attribute values, and combinations to extract the desired information effectively.

  • Data Scraping
  • Web Scraping
  • Tag Attributes
  • Data Extraction

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data Scraping II

  2. Finding Specific Tags

  3. Finding Tags with specific attributes Previously we showed you how to find all instances of a tag within a document using the find_all() method. But it can do more Often you don't want all the tags but rather ones with specific attributes or even attribute values. To find all instances of a tag with a specific, known attribute name, you can use the find_all() function in this form tags = soup.find_all('<tag>',<attribute>=True) This finds all instances of <tag> that have the <attribute> attribute (regardless of its value) and ignores all others It returns a list of Tag objects The following would find all the images with a height attribute: tags = soup.find_all('img',height=True)

  4. Finding tags with specific attribute values If we want to find tags with specific attributes and attribute values, we can use the find_all() function again: tags = soup.find_all('<tag>',{'<attribute>':'<value>'}) The first argument is the tag name The second argument is a dictionary with the attribute as the key and the attribute value as the dictionary value This would find all the 'tr' tags with an 'id' attribute with "data" as its value: tags = soup.find_all('tr',{'id':'data'}) If you're only matching a single tag, you can also use: tags = soup.find_all('tr', id='data')

  5. Searching multiple tags It's also possible to search for different tags that have the same attribute or same attribute and value. Instead of passing in a single tag as the first argument to find_all() you pass in a list of tags This finds all the 'p' and 'h1' tags that have an 'id' attribute tags = soup.find_all(['p','h1'], id=True) This finds all the 'h1' and 'tr' tags that have the 'id' attribute with "header" as its value: tags = soup.find_all(['h1','tr'], {'id':'header'})

  6. Searching for multiple attributes and values Since the collection of attributes and values to search for is a dictionary, we can add additional attribute-value pairs by adding entries to the dictionary tags = soup.find_all(['h1','tr','p'], {'id':'header','data':'index'}) The tag must have all the attribute-value pairs specified to match If you want to have different possible values for a single attribute, make the value for that attribute in the dictionary into a list containing the possible values tags = soup.find_all(['h1','tr','p'], {'id':['header','start'],'data':'index'})

  7. Using Regular Expressions As mentioned in a previous lecture, you can use regular expressions to select tags, attributes, or values To do so, you must first compile the regular expression to it can be used. This is done using the re.compile() function re.compile(<regex string>) This returns a regular expression object that you can bind to a name and use repeatedly or just put the re.compile() expression right where you want the regex to be used

  8. Using Regular Expressions (examples) data_index = re.compile(r'data\d*') This creates a regular expression that matches data followed by zero or more digits results = soup.find_all(['td','p'], {'id': re.compile(r'data\d*')}) This searches for any 'td' or 'p' tags that have the 'id' attribute with a value that matches the regular expression results = soup.find_all(['td','p','tr'], {'id':data_index,'title':data_index}) This uses the regex multiple time as the value for different attributes

  9. Data Scraping- Part II

  10. Reading Tables

  11. Reading Tables We've looked at reading arbitrary tags, let's look specifically at reading tabular data on a web page Imagine a table of degrees granted per year at a university Academic Year Bachelors 2021-2022 6406 2020-2021 6683 2019-2020 6684 1896-1897 1 Masters 1128 959 1033 Doctoral 233 192 212 Total 7767 7834 7929 0 0 1 We want a list with the data from each column and a list of column headings. How do we read this if it is rendered on a webpage

  12. The table as HTML <table id="degrees" border="1"> <tr><th>Academic Year</th><th>Bachelors</th> <th>Masters</th><th>Doctoral</th><th>Total</th></tr> <tr><td>2021-2022</td><td>6406</td> <td>1128</td><td>233</td><td>7767</td></tr> <tr><td>2020-2021</td><td>6683</td><td>959</td> <td>192</td><td>7834</td></tr> <tr><td>2019-2022</td><td>6684</td> <td>1033</td><td>212</td><td>7929</td></tr> ... <tr><td>1896-1897</td><td>1</td> <td>0</td><td>0</td><td>1</td></tr> </table>

  13. Exercise: Read the table's data How do we find the table? How to we get the column headers? How do we read data from each column/row?

  14. Exercise: Read the table's data (solution) table = soup.find_all('table',{'id':'degrees'})[0] heads = table.find_all('th') headers = [] for item in heads: headers.append(item.string) data = [[], [], [], [], []] # make a list of 5 lists, one for each column rows = table.find_all('tr') for row in rows: columns = row.find_all("td") index = 0 for col in columns: data[index].append(col.string) index += 1 for col in data: print(col)

  15. Handling Images

  16. Reading and Saving Images What if you want to save all the images on the page to a local directory? What information do you need to do this? The URL to the image The directory you want to save the image in The output filename How do we do this?

  17. Finding the URLs How would we find the URLs to all the images on a page? images = soup.find_all('img') img_srcs = [] for img in images: img_srcs.append(img['src']) Remember these could be relative links so you'll need to construct the full URL from the current page/domain before you try to access the images.

  18. Requesting the images Once you have the URL for the image, you can just make a GET request to have the server send it to you: image_response = requests.get(imageURL) The .text attribute on the response object is not going to give us what we need. There is another attribute, .raw, that gives us the raw bytes of the data in the response. Note: to properly use the data via the .raw attribute, your GET request needs to include an additional parameter: stream=True image_response = requests.get(imageURL, stream=True) We've got the raw data, what do we do with it?

  19. Saving binary data to a file We can use Python's copyfileobj() function (from the shutil library) to write the raw file contents directly to disk import shutil with open(output_filename, 'wb') as out_file: shutil.copyfileobj(image_response.raw, out_file) del image_response # this frees up the memory (optional) output_filename is the path+filename of the output file The 'wb' parameter says to open the file to write in binary format The copyfileobj() function takes a source of binary data and a destination you could use this to copy a file: open one file to read and use that as the source, and the second to write and use that as the destination The del command deletes the named object immediately instead of waiting for Python to do it. Can help to save memory.

  20. Using the image in the program You could also use the image directly in your program if you needed to. You'd access it using the response object's .content attribute which allows you to access the content as binary data. from PIL import Image #PIL is the library under byuimage from io import BytesIO #a built-in Python library image = Image.open(BytesIO(image_response.content)))

Related


More Related Content