DEV Community

Cover image for ๐Ÿ•ธ๏ธ Web Scraping in Python: A Practical Guide for Data Scientists
Vikas Gulia
Vikas Gulia

Posted on

๐Ÿ•ธ๏ธ Web Scraping in Python: A Practical Guide for Data Scientists

"Data is the new oil, and web scraping is one of the drills."

Whether youโ€™re gathering financial data, tracking competitor prices, or building datasets for machine learning projects, web scraping is a powerful tool to extract information from websites automatically.

In this blog post, weโ€™ll explore:

  • What web scraping is
  • How it works
  • Legal and ethical considerations
  • Key Python tools for scraping
  • A complete scraping project using requests, BeautifulSoup, and pandas
  • Bonus: Scraping dynamic websites using Selenium

โœ… What is Web Scraping?

Web scraping is the automated process of extracting data from websites. Think of it as teaching Python to browse the web, read pages, and pick out the data you're interested in.


โš–๏ธ Is Web Scraping Legal?

Scraping publicly available data for personal, educational, or research purposes is usually okay. However:

  • Always check the websiteโ€™s robots.txt file (www.example.com/robots.txt)
  • Read the Terms of Service
  • Avoid overloading servers with too many requests (use time delays)
  • Never scrape private or paywalled content without permission

๐Ÿงฐ Popular Python Libraries for Web Scraping

Library Purpose
requests To send HTTP requests
BeautifulSoup To parse and extract data from HTML
lxml A fast HTML/XML parser
pandas To organize and analyze scraped data
Selenium For dynamic websites with JavaScript
playwright Modern alternative to Selenium

๐Ÿงช Step-by-Step Web Scraping Example

Letโ€™s scrape quotes from http://quotes.toscrape.com โ€” a beginner-friendly practice site.

๐Ÿ› ๏ธ Step 1: Install Required Libraries

pip install requests beautifulsoup4 pandas
Enter fullscreen mode Exit fullscreen mode

๐Ÿงพ Step 2: Send a Request and Parse HTML

import requests
from bs4 import BeautifulSoup

URL = "http://quotes.toscrape.com/page/1/"
response = requests.get(URL)
soup = BeautifulSoup(response.text, "html.parser")

print(soup.title.text)  # Output: Quotes to Scrape
Enter fullscreen mode Exit fullscreen mode

๐Ÿงฎ Step 3: Extract the Quotes and Authors

quotes = []
authors = []

for quote in soup.find_all("div", class_="quote"):
    text = quote.find("span", class_="text").text.strip()
    author = quote.find("small", class_="author").text.strip()

    quotes.append(text)
    authors.append(author)

# Print sample
for i in range(3):
    print(f"{quotes[i]} โ€” {authors[i]}")
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“Š Step 4: Store Data Using pandas

import pandas as pd

df = pd.DataFrame({
    "Quote": quotes,
    "Author": authors
})

print(df.head())

# Optional: Save to CSV
df.to_csv("quotes.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” Scrape Multiple Pages

all_quotes = []
all_authors = []

for page in range(1, 6):  # First 5 pages
    url = f"http://quotes.toscrape.com/page/{page}/"
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")

    for quote in soup.find_all("div", class_="quote"):
        all_quotes.append(quote.find("span", class_="text").text.strip())
        all_authors.append(quote.find("small", class_="author").text.strip())

df = pd.DataFrame({"Quote": all_quotes, "Author": all_authors})
df.to_csv("all_quotes.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”„ Bonus: Scraping JavaScript-Rendered Sites using Selenium

Some sites load data dynamically with JavaScript, so requests won't work.

๐Ÿ› ๏ธ Install Selenium & WebDriver

pip install selenium
Enter fullscreen mode Exit fullscreen mode

Download the appropriate ChromeDriver from https://chromedriver.chromium.org/downloads and add it to your system path.

๐ŸŒ Selenium Example

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time

service = Service("chromedriver")  # Path to your ChromeDriver
driver = webdriver.Chrome(service=service)

driver.get("https://quotes.toscrape.com/js/")
time.sleep(2)  # Wait for JS to load

soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

for quote in soup.find_all("div", class_="quote"):
    print(quote.find("span", class_="text").text.strip())
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Best Practices for Web Scraping

  • โœ… Use headers to mimic a browser:
headers = {"User-Agent": "Mozilla/5.0"}
requests.get(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode
  • โœ… Add delays between requests using time.sleep()
  • โœ… Handle exceptions and errors gracefully
  • โœ… Respect robots.txt and terms of use
  • โœ… Use proxies or rotate IPs for large-scale scraping

๐Ÿ“ฆ Real-World Use Cases

  • ๐Ÿ“ฐ News Monitoring (e.g., scraping articles for sentiment analysis)
  • ๐Ÿ›’ E-commerce Price Tracking
  • ๐Ÿ“Š Competitor Research
  • ๐Ÿง  Training Datasets for NLP/ML projects
  • ๐Ÿข Job Listings and Market Analysis

๐Ÿ“Œ Final Thoughts

Web scraping is a foundational tool in a data scientistโ€™s arsenal. Mastering it opens up endless possibilities โ€” from building custom datasets to powering AI models with real-world information.

โ€œIf data is fuel, then web scraping is how you build your own pipeline.โ€

Top comments (0)