How to Scrape YouTube Easy with Python

Philipp Neuberger

March 5, 2024
1:12 pm

Share this article

YouTube is perhaps one of the most influential platforms. Within passing seconds, a hundred new videos are being added to YouTube. You can extract and then use a massive amount of data available. Now, how do you extract data from YouTube ? Well, here’s where web scraping comes into play. Learn how to scrape YouTube in a few minutes. Web scraping allows you to extract meaningful data from different websites and format it according to your needs. It can help you analyze the patterns of upcoming trends by using predictive models on the extracted data.

Web scraping allows you to automate the extraction of data from different websites, and by using proper formatting tricks, we can turn the unstructured chaotic YouTube listings into an organized dataset ready for analysis. In this tutorial, we will be scrape YouTube by using Python. So follow along if you want to do it as well.

Prepare the Setup to Scrape YouTube

Before starting the scraping process, you must have Python installed on the device. Hop on to the command prompt and type in “Python – version” to ensure you have Python installed; if not, go to their official website and download it.

We will be using several Python libraries; here is a list of all the ones we are going to use for this tutorial:

Selenium

It is one of the most popular Python libraries; it is designed to help us automate web browser interaction. At its very core, it allows us to simulate how a real user would be using the browser, helping us with the dynamic content (the content that is loaded on specific interactions)

BeautifulSoup

Another powerful Python web scraping library. It helps us parse the HTML. It allows us to search through the web page’s code and then extract the specific data elements such as video titles, view counts, and durations.

XLSXwriter

This library helps us format our scraped data well and then write it to an Excel file, which we can later use for further analysis.

Installing the Libraries

We will be using the default pip packet manager to install the desired libraries; open your terminal and type in the following command to install it:

				
					pip install selenium beautifulsoup4 xlsxwriter

Starting the Code

Just create a new folder named youtube-scraper or whatever you like, and then create a new .py file and type in this code:

				
					from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# Replace with the URL of the YouTube channel you want to analyze
url = 'https://www.youtube.com/@MuseAsia/videos'  

driver = webdriver.Chrome(ChromeDriverManager().install())

Note: You’ll need to provide the URL of the specific YouTube channel you wish to scrape YouTube efficiently.

Data Extraction with Python

Now, let’s start extracting the data using Python. Here’s how you can pull the data as well.

				
					import time 
from selenium import webdriver 
from bs4 import BeautifulSoup 
import xlsxwriter

# ... (Previous code)

driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url)) 

# Scrolling to load videos
times = 0
while times < 5: 
    time.sleep(1)  
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);") 
    times += 1

# Parsing the page content
content = driver.page_source.encode('utf-8').strip() 
soup = BeautifulSoup(content, 'lxml')

# Extracting data
titles = soup.findAll('a', id='video-title') 
views = soup.findAll('span', class_='style-scope ytd-grid-video-renderer') 
durations = soup.findAll('span', class_='style-scope ytd-thumbnail-overlay-time-status-renderer')

Here’s how the code works:

Retrieving Page Content

We use Selenium’s built-in function driver. get() to navigate to the YouTube channel’s videos section.

Loading Dynamic Content

Since YouTube uses dynamic content, it packs more videos as you scroll down. We will stimulate the scrolling event to ensure all the videos we want are in the page’s source code.

Parsing with BeautifulSoup

Our beloved BeautifulSoup takes the raw HTML (content) and transforms it into a structured format, ready for easy data extraction.

Targeting Data Elements

Then we will use the findAll method in BeautifulSoup; we need to carefully target the specific HTML tags and classes that contain our desired data. So make sure to inspect the selected page to fish out the essential tags

Error Handling (Important!)

Like any other dynamic website, YouTube keeps making changes in its HTML structure. If you face problems using the code, inspect the code, get the relevant tag names or classes, and update it in the script.

V. Organizing and Saving Data

Since we already did scrape YouTube´s data, let’s store it in an Excel spreadsheet.

Here is how you can use the list to save the data correctly:

				
					# Creating lists to store data
video_titles = []  # For storing video titles
video_views = []   # For storing views of each video
video_durations = []  # For storing durations of each video

# Populating the lists
For title in titles: 
    video_titles.append(title.text) 

for i in range(len(views)): 
    if i % 2 == 0: 
        video_views.append(views[i].text) 

for duration in durations: 
    video_durations.append(duration.text) 

# Creating the Excel file
workbook = xlsxwriter.Workbook('youtube_data.xlsx') 
worksheet = workbook.add_worksheet() 

# Add column headers
worksheet.write(0, 0, "Title") 
worksheet.write(0, 1, "Views") 
worksheet.write(0, 2, "Duration") 

# Add video data to rows
row = 1
for title, view, duration in zip(video_titles, video_views, video_durations): 
    worksheet.write(row, 0, title) 
    worksheet.write(row, 1, view) 
    worksheet.write(row, 2, duration) 
    row += 1

workbook.close()

Explanation

Firstly, we create three empty lists to hold our extracted titles, views, and durations.

Then, we iterate through each data element that we found by using the BeautifulSoup(titles, views, duration), and then we append the content to the corresponding list.

Lastly, we save it to Excel using XLSXwriter.

We create an Excel file named ‘youtube_data.xlsx.’
Add a worksheet within the file.
Write the headers (“Title”, “Views,” “Duration”) in the first row.
Finally, loop through our lists, writing each video’s data into a new row.

Don’t Forget: It’s crucial to close the workbook using workbook. close() to ensure all changes are saved correctly.

Start Preprocessing the Data

Why Preprocess? That’s the question that might pop into your mind, right? Raw scraped data is often not organized properly, so it’s our duty to refine it in a formatted manner so we can derive some meaningful insights from it.

To preprocess it and clean it, we can perform several things.

Some of them include,

Removing Extra Characters: Unwanted spaces or symbols might cause trouble with analysis.
Consistent Formatting: Converting data types (e.g., text views to numerical values) ensures compatibility with analytic tools, also called one hot encoding.
Categorization: Grouping values (e.g., dividing video durations into ‘short,’ ‘medium,’ etc.) can enhance pattern discovery.

We will be using pandas for preprocessing, so make sure to install it using the pip install pandas command.

After installing, load our Excel file using the pandas and tackle these columns.

				
					import pandas as pd 

data = pd.read_excel('youtube_data.xlsx') 

# Cleaning 'Views' column
data['Views'] = data['Views'].str.replace(" views","")  # Remove 'views' 
new_views = [] 

for view in data['Views']: 
    if(view.endswith('K')):  
        view = view.replace('K','') 
        new_views.append(float(view) * 1000) 
    Else: 
        new_views.append(view) 

data['Views'] = new_views

# Cleaning 'Duration' column ... (Similar logic as views)

# Categorizing 'Duration' ... (We'll define categories shortly)

You can make changes according to your needs.

				
					for i in data['Duration'].index: 
    val = data['Duration'].iloc[i] 
    if val == 'SHORTS': 
        continue
    elif val in range(0, 900): 
        data.loc[i, 'Duration'] = 'Mini-Videos'
    elif val in range(901, 3600): 
        data.loc[i, 'Duration'] = 'Long-Videos'
    Else: 
        data.loc[i, 'Duration'] = 'Very-Long-Videos'

We categorize videos based on their duration in seconds. Feel free to adjust these categories as you see fit. Remember to save your preprocessed DataFrame back to Excel to keep the cleaned data!

Why Proxies Matter to Srape YouTube?

Suppose you keep spamming the HTTP request. It leads to activating anti-bot measures, which can block your IP address. In this case, a Proxy service can do you much good.

Here are some of the reasons why proxy services are important:

Circumvention of Rate Limiting

Scrape YouTube data means you´re making a lot of HTTP calls to get the data. If this is done too often from the same IP address, rate-limiting methods may be set off. If you use a proxy service, though, you can spread these calls across different IP addresses, which gets around YouTube's rate limits. This lets you scrape data more quickly and easily without running into problems.

Avoiding IP Bans

To keep its site safe from abuse, YouTube uses tools to find and stop IP addresses that are scraping. You can hide your real IP address and switch between proxies when you use proxy services. This makes it harder for YouTube to find out about your scraping activities and ban you. This keeps you able to view YouTube's data without being banned and keeps you from getting stopped.

Protection of Privacy and Anonymity

Accessing private data or performing study anonymously may be necessary to Scrape YouTube for data. Proxy services let you hide your real IP address, which protects your privacy and anonymity while you're searching. By hiding your IP address, you can stop YouTube and other people from following what you do online. This will keep your scraping actions private and safe.

Choosing a Reliable Proxy Service

If you are looking for a quality proxy service, you should opt for Petaproxy. It’s one of the best proxy services and provides flexible services, so you can choose according to your needs.

Petaproxy offers two types of proxies to scrape YouTube:

Mobile Proxies: As defined above, a mobile proxy with IP rotation keeps changing your IP addresses each few minutes, making it harder for the website anti-bot measure to sniff you out.

Datacenter Proxies: These datacenter proxies can work, but they’re often more accessible for websites to detect since they come from recognizable IP ranges.

Conclusion

To sum up, web scraping is a strong way to get useful data from sites like YouTube, which can help you understand new trends and how people use these sites. We can automate the extraction process with tools like Selenium and BeautifulSoup, which turns unorganized web material into organized datasets that are ready to be analyzed.

But when scraping, it’s important to use proxies so that you don’t trigger anti-bot measures and risk getting your IP banned. Proxy servers hide your IP address, lower your risk of getting banned, and let you scrape YouTube from multiple IP addresses, all of which make you more anonymous and trustworthy.

Philipp Neuberger

Managing Director

Philipp, a key writer at Petaproxy, specializes in internet security and proxy services. His background in computer science and network security informs his in-depth, yet accessible articles. Known for clarifying complex technical topics, Philipp’s work aids both IT professionals and casual internet users.

Don’t miss anything! Sign up for our newsletter

Your One-Stop Source for Trending Updates and Exclusive Deals