Web Scraping with Beautiful Soup and Python

Philipp Neuberger

February 15, 2024
2:24 pm

Share this article

Web scraping is a skilled method that lets us get information from many online sources. Web scraping lets us get a lot of useful data and informations, like product reviews and market trends, which helps us make smart choices and stay ahead of the competition.

If you’re interested in learning how to use Python to scrape the web, you´re on the right place. You can use this article as a guide to learn the basics of web scraping with Beautiful Soup and Python. Come with us on this exciting journey as we learn the secrets of web scraping and find the riches it hides.

Web scraping can play a significant role in data analysis, as it can enable scraping from social media sites. This scraped data can then be visualized in different formats such as charts. Graph, etc., to figure out the recent trends.

Market Research and Competitor Analysis

Another region where web scraping plays a pivotal role is in gathering data for market research, things like pricing, product features, and reviews of the customer can be really useful. For example, you can use web scraping to compare the price of a product on different websites, in addition to price, you can add customer reviews as well. This can be quite useful for market research.

These are just a few of the applications of web scraping, and already you can see the importance of scraping, right? Now, knowing how to scrape can turn out to be an X factor on your resume if you are someone who is trying to make a career in the programming world.

If you are looking for a basic tutorial on how to scrap a website using Python, this blog is for you, we will be using Beautiful Soup, one of the most popular web scraping libraries that Python has to offer, so make sure to read the tutorial till the end, we will be covering basics of web scraping.

Getting Started with BeautifulSoup

Before really jumping into how to use BeautifulSoup, let’s start by telling what it is actually.

In a nutshell, BeautifulSoup is a Python Library that helps parse HTML documents and also helps us navigate through them. To start using BeautifulSoup, you need to make sure you have Python installed on your device. Head onto their official website, and get the latest version.

Once installed, now you can move on to installing BeautifulSoup.”So let’s embark on our journey of ‘Web Scraping with Beautiful Soup and Python’.”

To install it, we will be making use of pip, which is like a package manager for Python, so open up your terminal and type in the following command.

				
					pip install beautifulsoup4

Another thing that we need is well is the ability to request to the website, for there is also a library, named requests, to install it as well as write the following code in your cmd.

				
					pip install requests

Now, since we have everything we need, let’s start by importing the libraries, to scrape them, we will use the following code.

				
					from bs4 import BeautifulSoup
import requests

Now to start scraping we need a target website (one you are looking to scrap) and then we have to make a get request to it and we need to get the HTML for it.

				
					headers = {'User-Agent': "Mozilla/5.0"}  
url = "https://example.com"
webpage = requests.get(url, headers=headers)

soup = BeautifulSoup(webpage.text, 'html.parser')

print(soup)

We have added the headers, to mimic a browser, and then we request the page using the get(), after that, we store it into the webpage variable.Now we can create a BeautifulSoup object and pass it up to the webpage. text, this will pass the entire HTML code for the page to it.

There are several attributes and functions that you can use to inspect this, here are some of them:

find(name, attrs): this returns the first element that matches the parameter that you will give, for example you can use this in the following way.

				
					result = soup.find('div', class_='container')

This finds the first element that is a div tag and whose class is a container, in a very similar fashion you can use find_all(), to find all of the elements matching the parameters.

				
					result = soup.find_all('div', class_='container')

You can search out for different functions, to find more on how you can use the inbuilt functions. In additions to these functions, there is also a wide range of attributes associated with this Beautiful soup, here’s some of them,

soup.name: this returns the name of the element
soup.attrs: this will return the dictionary containing the element’s attributes
soup.text: as the name suggests, this will return the text content of the element

And similarly to these are a wide range of attributes, that you can look over the internet and use them according to your needs.

Extracting Data from Tables

In most websites, important data exists in the form of tables and is enclosed in the <table> tag. Now luckily BeautifulSoup also provides a mechanism to scrape data from them. To extract the data from tables, we can again make use of find/select methods as we discussed in the above sections.

For starters let’s assume our table looks this looks like this:

				
					<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Age</th>
      <th>Gender</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alice</td>
      <td>25</td>
      <td>Female</td>
    </tr>
    <tr>
      <td>Bob</td>
      <td>30</td>
      <td>Male</td>
    </tr>
    <tr>
      <td>Charlie</td>
      <td>35</td>
      <td>Male</td>
    </tr>
  </tbody>
</table>

Now inorder to scrap this data we can do the following code,

				
					# Get the table element
table = soup.find("table")

# Get the table header
header = [th.text for th in table.find("thead").find_all("th")]
# ['Name', 'Age', 'Gender']

# Get the table body
body = [[td.text for td in tr.find_all("td")] for tr in table.find("tbody").find_all("tr")]

The find() method finds the first <table> element in our HTML document, then in the second line of code we extract out the text content of every <th> element within the <thead> section of the table. Then the third line extracts out the text content of each of the <td> elements present in the <tr>. It creates a list of (body) of the lists, where each inner list will represent the data of a row in the table.

Handling Dynamic Content with Selenium and Other Alternatives

Unfortunately, one of the downsides that come with using Beautiful Soup, is that it is quite good for static web pages, however almost all the websites nowadays are dynamic, meaning they dynamically change the content, such as the user interactions or other events that take place.

One of the prime examples of this will be of AJAX-based websites, that will load more data as the user keeps on scrolling down. In these particular cases, our HTML source code (that we heavily relied on) will not contain the data we want, because it will be changing over time.

To deal with the dynamic content, what we can do is need something that will execute the JS and well render the webpage just like a browser would. One of the famous tools that provide a similar mechanism is Selenium.

Selenium is a library that is specific for automating web browsers, it can control several famous browsers such as Chrome, and Firefox and can perform user interactive actions such as clicking on a button, scrolling, and even typing stuff out. On top of that, Selenium can also access the HTML source code as well as the DOM elements of the webpage.

If we combine selenium with beautiful soup, we can end up making perhaps a few of the strongest scrapers, and we will be able to use almost any website. Once done with the BeautifulSoup basics, it is recommended you move on to Selenium.

Considerations When Web Scraping with Beautiful Soup and Python

Respectful Scraping and Robots.txt

Before starting out scraping your target website, make sure you go through their ‘robot.txt’ to be aware of the scraping policy, as avoiding any of them can make the scraping illegal.

Make sure to avoid Overloading Servers

In most of our cases, we need to scrap multiple pages as data is required in large volumes. If we don’t take breaks between our requests, what it can do is overload the server, as expected it can lead to problems, so always keep in mind you need to implement delays between your requests to prevent the server from overloading.

Use a Reputable Proxy Provider

What you can do to overcome the rate limits, enforced by the websites also to enhance your web scraping experience, and also to avoid the potential bans, is by using a proxy service. Yes, using just a proxy service can help you avoid all the limitations. What proxies do is they mask your IP address, which means an extra layer of “anonymity” is added to your scraping scripts and there’s also a rotating proxy that you can use. What it does, is dynamically allot you a new IP address, so it appears as each request originates from a different device.

Now, there is a long list of proxy services available out there, but one of the best choices you can make is by choosing Petaproxy. Scraping activities constantly require a proxy, and Petaproxy perhaps provides the best service for scraping, as there is a wide range of services you can choose from, from mobile proxies to datacenter proxies, everything is available there. They also guarantee high uptime rates and smooth operations, which makes sure your task continues to work without any hindrance.

Conclusion

You have learned how to use web scraping, HTML parsing, and Beautiful Soup’s features through our guide. Together, we’ve looked at how to use Selenium to handle changing content and talked about ethics issues. You can now safely get useful data from the internet and confidently move around in the digital world after reading these tips.

Philipp Neuberger

Managing Director

Philipp, a key writer at Petaproxy, specializes in internet security and proxy services. His background in computer science and network security informs his in-depth, yet accessible articles. Known for clarifying complex technical topics, Philipp’s work aids both IT professionals and casual internet users.

Don’t miss anything! Sign up for our newsletter

Your One-Stop Source for Trending Updates and Exclusive Deals