What is Web Scraping

Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools (e.g. Beautiful Soup Library)

How do I scrape a website in Python?

For web scraping to work in Python, we’re going to perform three basic steps:

  1. Extract the HTML content using the requests library.
  2. Analyze the HTML structure and identify the tags which have our content.
  3. Extract the tags using Beautiful Soup and put the data in a Python list.

Installing the libraries

Let’s first install the libraries we’ll need. The requests library fetches the HTML content from a website. Beautiful Soup parses HTML and converts it to Python objects. To install these for Python 3, run:

pip install requests
pip install beautifulsoup4

Extracting the HTML

For this example, I’ll choose to scrape the Daily Trust section of this website. If you go to that page, you’ll see a list of articles with title, excerpt, and publishing date.

The full URL for the web page is:

https://dailytrust.com/does-tinubu-have-what-it-takes-to-rule-nigeria/

We can get the HTML content from this page using requests:

import requests

url = 'https://dailytrust.com/does-tinubu-have-what-it-takes-to-rule-nigeria/'

data = requests.get(url)
print(data.text)

Get Website Title using Beautiful Soup

Beautiful Soup is a Python library used to pull the data out of HTML and XML files for web scraping purposes. It produces a parse tree from page source code that can be utilized to drag data hierarchically and more legibly.

# importing the modules
import requests
from bs4 import BeautifulSoup

# target url
url = 'https://dailytrust.com/does-tinubu-have-what-it-takes-to-rule-nigeria/'

# making requests instance
reqs = requests.get(url)

# using the BeautifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')

# displaying the title
for title in soup.find_all('title'):
print(title.get_text())

output:

THE BEARING: Does Tinubu Have What It Takes To Rule Nigeria? - Daily Trust


Comments

Popular Posts