Web scraping is a method through which you can extract data from the websites, and then can use the data for many purposes, such as data analysis, market research etc. If you are someone who is new to Web Scraping don’t worry we got your back, for this tutorial we will be exploring the main concepts of web scraping using Selenium, so make sure to read till the very end, as we will be starting from the very basics.
What is Web Scraping, and Why is it Important?
As mentioned above, web scraping is the process of retrieving data from different sites. It works by sending requests to different web servers and then parsing the HTML response and then filtering out the desired part from it.
There are many important reasons why web scraping is useful, some of them are :
E-commerce Price Monitoring:
Let’s suppose you want to monitor the prices of some product across different online stores, in order to compare them instead of searching them manually, what you can do is scrap websites for it, and save that to data rich format such as CSV.
Job Market Insights:
Web Scraping can also help out when you are in need to search out for job openings, it can help you scrap important data such as Salaries, Skills and identify the demand in various sectors, ultimately helping you pick out the best career opportunity.
Social Media Aggregation:
Web Scraping can also help fetch out data from social media which can be used for tasks like sentiment analysis and identify the preferences of the audience on a particular matter.
Role of Selenium and Python in Web Scraping
Python is one of the best languages that you can use for web scraping tasks, simply because it has a wide variety of frameworks you can choose from. From those wide range of libraries, Selenium is perhaps the most important and best framework for web scraping. Selenium is an open-source python framework that allows you to automatic real time web browser actions such as navigation, clicking and typing etc.
The reason why Selenium is an important tool is because it is actually a browser automation tool, meaning that it can help you control and interact with a web browser.
Some of the things that make Selenium popular when it comes to web scraping are:
Ability to handle dynamic web pages:
Most of the websites we come across on a daily basis are not static, rather they are dynamic, and scraping them can be a quite handy task. But Selenium is able to handle web pages that have dynamic and interactive web elements such as JS, AJAX(which makes the website change its content without reloading.
Ability to mimic human interactions:
Since dynamic websites require human interaction such as mouse movements, pressing certain websites, which are made in order to prevent bots from accessing web pages, by using Selenium we can overcome such challenges.
Ability to take Screenshot
Selenium can take screenshots of web pages, which can help us to debug or document our web scraping process, or capture visual data, such as images, charts, etc.
Web Scraping with Selenium
Enough with the advantages and introductory section, lets actually jump into technical details on how to use Selenium for web scraping.
Installation
The very first step we need to do is to actually start by installing Selenium, selenium actually needs one additional thing that is the Web Driver, that is because it needs communication between the web browser and Selenium library.
In order to download the Selenium we can use the pip command (package manager for Python:)
pip install selenium
Now to install WebDriver, we will need to download the executable file that well matches our browser & operating system. For instance the driver for chrome and firefox will be different, download the one according to your needs, once installed update the system path.
Chrome Headless Mode
Chrome Headless is one of the most important optimization feature that allows us to run Chrome without actually opening the GUI, what it does is help us save resources and speed our scraping process, as we don’t need to load the heavy stuff such as images and css, ( as they are not relevant for our data extraction.
Here’s how you can enable headless mode, for this instance i’m using chrome web driver:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
arg= Options()
arg.headless = True
arg.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=arg, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()
Using Web Page Driver Properties
We can use the driver object to access and manipulate various properties of the web page, such as the URL, the title, the source code, etc.
# Load the web page
driver.get("http://quotes.toscrape.com/")
# Get the title of the web page
title = driver.title
print(title)
You can search out different properties and use them accordingly according to your requirement.
Locating Elements to Extract Desired Information
The main process when it comes to web scraping is the ability to locate the HTML element that contains our relevant data. Luckily, Selenium provides us with various methods in order to locate the HTML elements. We can use ID, name, class, tags to search for the HTML elements. However, perhaps one of the best ways is to use XPath or CSS selectors.
Xpath
XPath actually stands for XML Path Language, and it is a way to navigate through the hierarchical structure of an XML file, (similar to HTML). You can use different XPath expressions such as operators and functions to select your target elements.
For instance, the express div[@class=‘product’] will select all the <div> elements that have the class with value as product.
CSS Selectors
On the other hand CSS selectors are patterns that can match HTML elements based on their attributes, classes, IDs, pseudo-classes, pseudo-elements, etc. For example, the CSS selector `div.product` will select all the `<div>` elements that have the class `product`.
Selectors in Selenium
In the case of Selenium, it provides both methods `find_element_by_xpath` and `find_element_by_css_selector` to help us locate the HTML elements. If you use these functions it will return a “WebElement” object that represents a single HTML element on the webpage. Now this object has many attributes and methods that you can use to interact with it, such as ‘text’, ‘click’, ‘send_keys’.
Here’s a way how you can use XPath and CSS selector
from selenium import webdriver
# Create a WebDriver object
driver = webdriver.Chrome()
# Navigate to a web page
driver.get("https://example.com")
# Locate an element using XPath
element = driver.find_element_by_xpath("//div[@class='product']")
# Print the text of the element
print(element.text)
# Locate an element using CSS selector
element = driver.find_element_by_css_selector("div.product")
# Click on the element
element.click()
Taking Screenshots
Sometimes you might want to capture the screenshots of webpages, for instance for documentation purposes, with selenium is really easy, you can just use save_screenshot method, the method takes a file name as an arg and saves the screenshot of webpage as a PNG file, here’s how you can use it:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://example.com")
# Take a screenshot and save it as example.png
driver.save_screenshot("example.png")
Wait for an Element
Content that loads asynchronously can sometimes be challenging to deal with. This can lead to a stage where the content isn’t present when we extract the HTML. If we try to locate and interact with such elements without waiting for them to be present, we may encounter errors or get incorrect results.
In order to avoid such errors, what we can do is use waiting strategies in web scraping. Waiting strategies are methods that can make the WebDriver object wait for a certain condition to be met before proceeding with the next step. Selenium provides two types of waiting strategies: implicit wait and explicit wait.
1.Implicit Wait
If you want the WebDriver object to wait for a specified amount of time, before throwing an exception that the element you are looking for isn’t there, you can use implicit wait. Implicit wait can be set using the `implicitly_wait` method, which takes a time value in seconds as an argument.
For example, the following code will set the implicit wait to 10 seconds:
from selenium import webdriver
# Create a WebDriver object
driver = webdriver.Chrome()
# Set the implicit wait to 10 seconds
driver.implicitly_wait(10)
# Navigate to a web page
driver.get("https://example.com")
2.Explicit Wait
Implicit Wait, can sometimes come across quite rigid, but you can overcome this rigidity using the Explicit Wait. It provides more flexibility as it adds in a demand for a specific condition to be met before proceeding further.
You can implement explicit wait using the `WebDriverWait` class and the `expected_conditions`.
For example, here’s a simple snippet that creates a WebDriverWait object that will wait for up to 10 seconds, and use the `presence_of_element_located` method to wait for an element to be present on the web page:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='product']")))
Conclusion
There you go! You now know everything you need to know about using Selenium to scrape websites. You’ve covered a lot of ground, from the basics of web scraping to its significance in areas like job market information and e-commerce price monitoring, as well as how Selenium and Python fit into this process. You can handle a wide range of scraping problems with Selenium’s ability to handle changing web pages, act like a person, and take screenshots.
Tip: Web Scraping Plus Proxy For No IP Bans
Now Web Scraping is a great way to extract relevant data from the websites, but there are some challenges that come with it as well, such as IP bans, CAPTCHAs and dynamic content. What you can do to overcome these obstacles by using a proxy service that masks your identity and provides you with a fast connection.
If ypu need a mobile proxy, the best service you can use is PetaProxy. It provides you with various proxy options ranging from mobile to even datacenter. Whether you want to scrap e-commerce, social media, news, or any other website, with Petaproxy, you can choose from HTTP and SOCKS5 proxies that are designed for different online tasks.