Web Scraping APIs Latency Test

#programming #python #webscraping #hasdata

We recently collected some stats to determine which web scraping API performs better. One key metric was latency, so I want to focus on the testing script here.
To keep things fair, all tests were run on the same machine and around the same time of day.

Step 1: Imports & Config
Step 2: API-Specific Functions
Step 3: Percentiles & Test Runner
Step 4: Main Logic & API Keys
Step 5: Results

Step 1: Imports & Config

Let’s start by installing the needed libraries (excluding preinstalled ones):

pip install requests pandas numpy

Here’s a quick summary of what each library does in this project:

Library	Purpose / Description
`requests`	For sending HTTP requests and interacting with APIs
`time`	Provides time-related functions (e.g. delays, timestamps)
`pandas`	Data manipulation and analysis using DataFrames
`numpy`	Numerical computing, array operations
`json`	Parsing and creating JSON-formatted data

Now, import them into your project:

import requests
import time
import pandas as pd
import numpy as np
import json

Step 2: API-Specific Functions

Make a list of the web scraping APIs you want to test. In our case, we’re comparing the top 3: HasData, OxyLabs, and ScrapingBee.
Let’s set a test URL and define how many times each API will be called:

TEST_URL = "https://httpbin.org/html"  
N_REPEATS = 100

Next, create function templates to measure response times:

def test_hasdata(api_key):
    pass

def test_oxylabs(username, password):
    pass

def test_scrapingbee(api_key):
    pass

For example, a function for HasData might look like this:

def test_hasdata(api_key):
    times = []
    for _ in range(N_REPEATS):
        url = "https://api.hasdata.com/scrape/web"


        payload = json.dumps({
            "url": TEST_URL,
            "proxyType": "datacenter",
            "proxyCountry": "US",
        })
        headers = {
            'Content-Type': 'application/json',
            'x-api-key': api_key
        }


        start = time.time()
        resp = requests.request("POST", url, headers=headers, data=payload)
        times.append(time.time() - start)
    return times

Now do the same for the other APIs you want to compare.

Step 3: Percentiles & Test Runner

We’ll need a function to calculate percentiles:

def calc_percentiles(times):
    return {
        "p50": round(np.percentile(times, 50), 3),
        "p75": round(np.percentile(times, 75), 3),
        "p95": round(np.percentile(times, 95), 3)
    }

This helps us understand the fastest response time for 50%, 75%, and 95% of requests.
Then write a function to run the tests:

def run_all_tests(credentials):
    results = {}
    results["HasData"] = calc_percentiles(test_hasdata(credentials["HasData"]))
    results["Oxylabs"] = calc_percentiles(test_oxylabs(credentials["Oxylabs"]["username"], credentials["Oxylabs"]["password"]))
    results["ScrapingBee"] = calc_percentiles(test_scrapingbee(credentials["ScrapingBee"]))
    return pd.DataFrame(results).T

Step 4: Main Logic & API Keys

Now, let’s put everything together in a main function and add credentials for the APIs you’re testing:

if __name__ == "__main__":
    credentials = {
        "HasData": "YOUR-API-key",
        "Oxylabs": {"username": "YOUR-USERNAME", "password": "YOUR-PASSWORD"},
        "ScrapingBee": "YOUR-API-key"
    }


    df = run_all_tests(credentials)
    print(df)

Step 5: Results

After running the script, you'll get real performance data for your selected APIs. For our set, here’s an example of the results:

Of course, judging web scraping APIs by speed alone isn't fair. In the next post, we’ll dig deeper into how these three services compare overall.
You can also check out the full article where we compared the 7 best web scraping APIs, this script was originally built for that.