Web Scraping with Python - Getting Started
I have recently started learning Python and wanted to use what I am learning to make something interesting,
There comes Web Scarping.
I have started reading about it and how it can be done with Python. My general understanding is that Web Scraping is - browsing the web with software, parse the pages and get specific information that we might be interested.
How do we do it? The one library that is best out there is "BeautifulSoup".
Books To Scrape is a website that is very scrape friendly.
I have written the following WebScraping code, which can be found at my GitHub.
As any python file, first we import some libraries that are needed including "BeautifulSoup".
Import Packages
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tabulate import tabulate
Download the HTML and Parse
First we the get the URL to scrape, and create a BeautifulSoup object from the parsed HTML text.
start_url = 'http://books.toscrape.com/index.html'
downloaded_html = requests.get(start_url)
soup = BeautifulSoup(downloaded_html.text, "lxml")
full_list = soup.select('.side_categories ul li ul li')
regex = re.compile(r'\n[ ]*')
we now loop through all the elements in the full_list and extract the desired data
book_dict = [{}]
for element in full_list:
link_text = element.get_text()
link_text = regex.sub('', link_text)
anchor_tag = element.select('a')
fullbooklink = "http://books.toscrape.com/"+anchor_tag[0]['href']
if (len(link_text) > 0 or len(fullbooklink) > 0):
book_dict.append({'Category': link_text,
'Link': fullbooklink})
we can now save these results to a text file or to a DB. BookCategoryList is the file we open with write permissions and loop though the book dictionary and write each item to the file.
with open('BookCategoryList', 'w') as file:
for item in book_dict:
if (len(item) > 0):
file.write("%s\n" % item)
You can see the sample file here Book Categories
I hope this helps.
Useful links
BeautifulSoup Documentation
Thank you
Vijaya Malla
@vijayamalla