Vikas Gulia

Posted on Jun 22

🕸️ Web Scraping in Python: A Practical Guide for Data Scientists

#programming #datascience #webscraping #beginners

"Data is the new oil, and web scraping is one of the drills."

Whether you’re gathering financial data, tracking competitor prices, or building datasets for machine learning projects, web scraping is a powerful tool to extract information from websites automatically.

In this blog post, we’ll explore:

What web scraping is
How it works
Legal and ethical considerations
Key Python tools for scraping
A complete scraping project using requests, BeautifulSoup, and pandas
Bonus: Scraping dynamic websites using Selenium

✅ What is Web Scraping?

Web scraping is the automated process of extracting data from websites. Think of it as teaching Python to browse the web, read pages, and pick out the data you're interested in.

⚖️ Is Web Scraping Legal?

Scraping publicly available data for personal, educational, or research purposes is usually okay. However:

Always check the website’s robots.txt file (www.example.com/robots.txt)
Read the Terms of Service
Avoid overloading servers with too many requests (use time delays)
Never scrape private or paywalled content without permission

🧰 Popular Python Libraries for Web Scraping

Library	Purpose
`requests`	To send HTTP requests
`BeautifulSoup`	To parse and extract data from HTML
`lxml`	A fast HTML/XML parser
`pandas`	To organize and analyze scraped data
`Selenium`	For dynamic websites with JavaScript
`playwright`	Modern alternative to Selenium

🧪 Step-by-Step Web Scraping Example

Let’s scrape quotes from http://quotes.toscrape.com — a beginner-friendly practice site.

🛠️ Step 1: Install Required Libraries

pip install requests beautifulsoup4 pandas

🧾 Step 2: Send a Request and Parse HTML

import requests
from bs4 import BeautifulSoup

URL = "http://quotes.toscrape.com/page/1/"
response = requests.get(URL)
soup = BeautifulSoup(response.text, "html.parser")

print(soup.title.text)  # Output: Quotes to Scrape

🧮 Step 3: Extract the Quotes and Authors

quotes = []
authors = []

for quote in soup.find_all("div", class_="quote"):
    text = quote.find("span", class_="text").text.strip()
    author = quote.find("small", class_="author").text.strip()

    quotes.append(text)
    authors.append(author)

# Print sample
for i in range(3):
    print(f"{quotes[i]} — {authors[i]}")

📊 Step 4: Store Data Using pandas

import pandas as pd

df = pd.DataFrame({
    "Quote": quotes,
    "Author": authors
})

print(df.head())

# Optional: Save to CSV
df.to_csv("quotes.csv", index=False)

🔁 Scrape Multiple Pages

all_quotes = []
all_authors = []

for page in range(1, 6):  # First 5 pages
    url = f"http://quotes.toscrape.com/page/{page}/"
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")

    for quote in soup.find_all("div", class_="quote"):
        all_quotes.append(quote.find("span", class_="text").text.strip())
        all_authors.append(quote.find("small", class_="author").text.strip())

df = pd.DataFrame({"Quote": all_quotes, "Author": all_authors})
df.to_csv("all_quotes.csv", index=False)

🔄 Bonus: Scraping JavaScript-Rendered Sites using Selenium

Some sites load data dynamically with JavaScript, so requests won't work.

🛠️ Install Selenium & WebDriver

pip install selenium

Download the appropriate ChromeDriver from https://chromedriver.chromium.org/downloads and add it to your system path.

🌐 Selenium Example

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time

service = Service("chromedriver")  # Path to your ChromeDriver
driver = webdriver.Chrome(service=service)

driver.get("https://quotes.toscrape.com/js/")
time.sleep(2)  # Wait for JS to load

soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

for quote in soup.find_all("div", class_="quote"):
    print(quote.find("span", class_="text").text.strip())

🧠 Best Practices for Web Scraping

✅ Use headers to mimic a browser:

headers = {"User-Agent": "Mozilla/5.0"}
requests.get(url, headers=headers)

✅ Add delays between requests using time.sleep()
✅ Handle exceptions and errors gracefully
✅ Respect robots.txt and terms of use
✅ Use proxies or rotate IPs for large-scale scraping

📦 Real-World Use Cases

📰 News Monitoring (e.g., scraping articles for sentiment analysis)
🛒 E-commerce Price Tracking
📊 Competitor Research
🧠 Training Datasets for NLP/ML projects
🏢 Job Listings and Market Analysis

📌 Final Thoughts

Web scraping is a foundational tool in a data scientist’s arsenal. Mastering it opens up endless possibilities — from building custom datasets to powering AI models with real-world information.

“If data is fuel, then web scraping is how you build your own pipeline.”

DEV Community