1. Introduction: The Need for Speed in Web Scraping
In the world of data, speed is often synonymous with value. Whether you're monitoring competitor prices, gathering real-time news, aggregating market data, or building vast datasets for machine learning, the efficiency of your web scrapers directly impacts the freshness, completeness, and utility of the information you acquire (Data Engineering Weekly, 2025).
However, traditional web scrapers often operate in a sequential manner. Imagine your scraper as a person standing in line at a grocery store, waiting for each customer to be served before moving to the next. In the context of web scraping, this translates to:
- Send a request to a website.
- Wait for the server to respond.
- Receive the response.
- Process the data.
- Repeat the entire process for the next URL.
The most significant bottleneck here is the waiting time. Network latency, server response times, and even the time it takes for a browser to render a page (as seen with SPAs) can add significant delays (Mozilla Developer Network, 2025). For a single request, this delay is negligible. But when you need to scrape hundreds, thousands, or even millions of pages, these accumulated wait times can stretch a scraping job from minutes to hours, or even days. This sluggishness can render your data obsolete before it's even fully collected, especially in dynamic environments.
This inherent inefficiency of sequential execution creates a critical need for faster, more sophisticated scraping techniques. The solution lies in fundamentally changing how your scraper handles tasks: by performing multiple operations concurrently or in parallel.
This article will turbocharge your web scraping capabilities by diving deep into asynchronous and parallel programming in Python. We'll demystify the core concepts, explore the powerful Python libraries like asyncio
for concurrent network operations and concurrent.futures
for parallel processing, and provide practical examples. Furthermore, we'll discuss the crucial best practices that accompany high-speed scraping, such as managing rate limits and avoiding IP blocks. By the end, you'll be equipped to build highly efficient and robust web scrapers that can tackle even the most demanding data acquisition challenges.
2. Concurrency vs. Parallelism: Demystifying the Concepts
Before we dive into specific Python tools, it's crucial to understand the fundamental difference between concurrency and parallelism. These terms are often used interchangeably, but they describe distinct ways of managing tasks (Stack Overflow, 2025). Understanding this distinction is key to choosing the right approach for your scraper's performance needs.
2.1. Concurrency: Dealing with Many Things at Once
Concurrency means managing multiple tasks seemingly at the same time. A system is concurrent if it can make progress on multiple tasks, even if only one task is truly executing at any given instant. Think of a single chef (a single CPU core) juggling multiple cooking tasks:
- The chef puts pasta on to boil.
- While the water heats, the chef chops vegetables for the sauce.
- Then, the chef stirs the pasta.
- While the pasta cooks, the chef prepares a salad.
The chef isn't doing all tasks simultaneously, but by switching between them during "wait times" (e.g., waiting for water to boil, waiting for vegetables to chop), they complete everything faster than if they did each task sequentially from start to finish.
In programming, concurrency is typically achieved through:
- Multithreading: A single process runs multiple threads, which share the same memory space. The operating system rapidly switches between threads, giving the illusion of simultaneous execution.
- Asynchronous I/O (Event Loops): A single thread manages multiple I/O operations (like network requests). When one operation is waiting (e.g., for a web server to respond), the program switches to another operation that is ready to proceed.
Concurrency is ideal for I/O-bound tasks, where the program spends most of its time waiting for external operations (like network requests, disk I/O, or database queries).
2.2. Parallelism: Doing Many Things at Once
Parallelism means actually executing multiple tasks simultaneously. This requires multiple independent processing units (e.g., multiple CPU cores, multiple machines). Think of multiple chefs (multiple CPU cores) working in the same kitchen:
- Chef 1 boils the pasta.
- Chef 2 chops the vegetables.
- Chef 3 prepares the salad.
All three tasks are truly happening at the exact same moment.
In programming, parallelism is primarily achieved through:
- Multiprocessing: A program creates multiple separate processes, each with its own memory space and Python interpreter. These processes can then run on different CPU cores simultaneously.
Parallelism is ideal for CPU-bound tasks, where the program spends most of its time performing intensive calculations or computations.
2.3. Python's Global Interpreter Lock (GIL)
Understanding concurrency and parallelism in Python requires acknowledging the Global Interpreter Lock (GIL) (Python Documentation: Global Interpreter Lock, 2025). The GIL is a mutex that protects access to Python objects, preventing multiple native threads from executing Python bytecodes at once.
- Impact on Multithreading: Due to the GIL, even if you use multithreading in Python, your threads cannot execute Python bytecode in parallel on multiple CPU cores. Only one thread can hold the GIL at a time.
- When Multithreading Is Useful: The GIL is released when a thread is performing an I/O operation (e.g., waiting for a network response, reading from a disk). This means that for I/O-bound tasks, multithreading can still offer performance benefits because while one thread is waiting for I/O, another thread can acquire the GIL and execute Python code.
- When Multiprocessing Is Needed: For truly CPU-bound tasks that require simultaneous execution on multiple cores, you must use multiprocessing (creating separate processes) to bypass the GIL.
2.4. Relevance to Web Scraping
Web scraping is predominantly an I/O-bound task. Your scraper spends the vast majority of its time waiting for:
- Network requests to travel to the web server.
- The web server to process the request and generate a response.
- The response data to travel back across the network.
- A browser (if using Selenium/Playwright) to load and render content.
Very little time is spent on intensive Python computations (unless you're doing complex data parsing or heavy local processing after fetching).
Therefore, for web scraping, concurrency is generally the more impactful strategy for performance gains. This means:
- Asynchronous I/O (
asyncio
): The most efficient way to manage concurrent network requests within a single thread. - Multithreading (
ThreadPoolExecutor
): Can also be used for concurrency in I/O-bound tasks, as threads release the GIL during network waits.
Parallelism (ProcessPoolExecutor
) becomes relevant if you have significant CPU-bound post-processing of the scraped data (e.g., applying complex regex, running ML models for sentiment analysis on text, heavy data cleaning, or if you want to run entirely independent scraping jobs on separate cores).
With this distinction clear, we can now explore the Python tools designed to leverage these concepts for turbocharging your scrapers.
3. Asynchronous Scraping with asyncio
for I/O-Bound Tasks
As established, web scraping is primarily I/O-bound. This means your program spends most of its time waiting for network responses. Python's asyncio
library, introduced in Python 3.4, is the standard choice for writing concurrent code using the async
/await
syntax (Python Documentation: Asyncio, 2025). It allows a single thread to manage many concurrent I/O operations efficiently, without blocking the entire program while waiting.
3.1. asyncio
Fundamentals: async
, await
, and the Event Loop
At the heart of asyncio
are a few key concepts:
- Coroutines (
async def
): Functions defined withasync def
are coroutines. They are special functions that can be "paused" during their execution to allow other tasks to run, and then "resumed" later. They don't block the execution of other parts of your program. await
: Theawait
keyword can only be used inside anasync def
function. It tells the event loop to "pause" the current coroutine and switch to another task until theawait
ed operation (typically an I/O operation like a network request) is complete.- Event Loop: This is the core of
asyncio
. It's a loop that monitors coroutines, identifies when one is waiting for an I/O operation, and then switches execution to another coroutine that is ready to run. When the waiting operation finishes, the event loop resumes the paused coroutine.
This allows your scraper to initiate multiple web requests almost simultaneously. While the first request is waiting for a response, your scraper can send the second, then the third, and so on. When any response comes back, the event loop processes it, then moves on to the next available task.
3.2. aiohttp
for Asynchronous HTTP Requests
While asyncio
provides the framework for asynchronous programming, you need an asynchronous HTTP client to make the actual web requests. aiohttp
is a popular and robust library for this purpose (aiohttp Documentation, 2025). It's built on top of asyncio
and provides an intuitive API for making GET
, POST
, and other HTTP requests.
First, ensure you have aiohttp
installed:
pip install aiohttp
3.3. Implementation Example: Concurrent Scraping
Let's adapt our simple book scraper (from previous articles) to use asyncio
and aiohttp
to fetch multiple URLs concurrently.
async_scraper.py
:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import json
import time
import os
# List of URLs to scrape concurrently (using different categories from books.toscrape.com)
URLS = [
"http://books.toscrape.com/catalogue/category/books/travel_2/index.html",
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
"http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html",
"http://books.toscrape.com/catalogue/category/books/classics_6/index.html",
"http://books.toscrape.com/catalogue/category/books/poetry_23/index.html",
"http://books.toscrape.com/catalogue/category/books/science_22/index.html",
"http://books.toscrape.com/catalogue/category/books/programming_33/index.html",
"http://books.toscrape.com/catalogue/category/books/fiction_10/index.html",
"http://books.toscrape.com/catalogue/category/books/childrens_11/index.html",
"http://books.toscrape.com/catalogue/category/books/humor_30/index.html",
]
async def fetch_page(session: aiohttp.ClientSession, url: str) -> str | None:
"""Asynchronously fetches the content of a single URL.
Args:
session (aiohttp.ClientSession): The aiohttp client session to use for the request.
url (str): The URL to fetch.
Returns:
str | None: The HTML content of the page if successful, None otherwise.
Raises:
aiohttp.ClientError: If an HTTP error (4xx or 5xx) occurs.
asyncio.TimeoutError: If the request times out.
"""
try:
async with session.get(url, timeout=10) as response:
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
print(f"Fetched {url} - Status: {response.status}")
return await response.text()
except aiohttp.ClientError as e:
print(f"Error fetching {url}: {e}")
return None
except asyncio.TimeoutError:
print(f"Timeout fetching {url}")
return None
except Exception as e:
print(f"An unexpected error occurred while fetching {url}: {e}")
return None
def parse_books(html_content: str) -> list[dict]:
"""Parses book titles and prices from HTML content.
This function is synchronous as parsing is typically CPU-bound.
Args:
html_content (str): The HTML content of the page as a string.
Returns:
list[dict]: A list of dictionaries, where each dictionary represents a book
with 'title' and 'price' keys. Returns an empty list if parsing fails
or no books are found.
"""
if not html_content:
return []
soup = BeautifulSoup(html_content, 'html.parser')
books_data = []
# Find all book articles
articles = soup.find_all('article', class_='product_pod')
for article in articles:
try:
title = article.h3.a['title']
price = article.find('p', class_='price_color').text.strip()
books_data.append({'title': title, 'price': price})
except AttributeError as e:
print(f"Skipping a book due to missing element: {e}")
continue
return books_data
async def scrape_all_urls(urls: list[str]) -> list[dict]:
"""Manages fetching and parsing multiple URLs concurrently.
Args:
urls (list[str]): A list of URLs to scrape.
Returns:
list[dict]: A consolidated list of all scraped books from all successful URLs.
"""
all_scraped_books = []
# Use a single aiohttp ClientSession for efficiency.
# This manages connection pooling and cookies effectively.
async with aiohttp.ClientSession() as session:
# Create a list of coroutine tasks for fetching each page.
tasks = [fetch_page(session, url) for url in urls]
# Run all fetch tasks concurrently and wait for them to complete.
# return_exceptions=True allows other tasks to complete even if one fails,
# collecting the exception instead of stopping the whole gather.
html_contents = await asyncio.gather(*tasks, return_exceptions=True)
# Process the results after all fetches are attempted.
for i, content in enumerate(html_contents):
url = urls[i]
if isinstance(content, str): # Check if content is string (successful fetch)
print(f"Parsing content from {url}...")
parsed_data = parse_books(content)
all_scraped_books.extend(parsed_data)
else:
# If content is an exception object or None, it means the fetch failed.
print(f"Skipping parsing for {url} due to error/timeout.")
return all_scraped_books
if __name__ == "__main__":
"""Main execution block for the asynchronous scraper."""
start_time = time.time()
# Run the main asynchronous function using asyncio.run().
scraped_data = asyncio.run(scrape_all_urls(URLS))
end_time = time.time()
if scraped_data:
# Define output directory and file for scraped data.
output_dir = os.getenv('OUTPUT_DIR', '.') # Default to current directory
os.makedirs(output_dir, exist_ok=True) # Ensure directory exists
output_file_path = os.path.join(output_dir, 'scraped_books_async.json')
with open(output_file_path, 'w', encoding='utf-8') as f:
json.dump(scraped_data, f, ensure_ascii=False, indent=4)
print(f"\nSuccessfully scraped {len(scraped_data)} books. Data saved to {output_file_path}")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
else:
print("No books scraped or an error occurred during scraping.")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
Explanation:
async def fetch_page(session, url):
: This coroutine performs the actual HTTPGET
request usingaiohttp
. The awaitresponse.text()
line is where the coroutine will "pause" and yield control back to the event loop while it waits for the network response.async with aiohttp.ClientSession() as session:
: It's crucial to create a singleClientSession
object and reuse it for all requests. This manages connection pooling and cookies efficiently, which is vital for performance.tasks = [fetch_page(session, url) for url in urls]
: This creates a list of coroutine objects (tasks), but they are not yet running.await asyncio.gather(*tasks, return_exceptions=True)
: This is the magic.asyncio.gather
takes multiple coroutines and runs them concurrently on the event loop. It waits until all of them are complete. return_exceptions=True is a good practice: if one task fails, gather won't immediately stop; it will collect the exception and allow other tasks to finish.asyncio.run(scrape_all_urls(URLS))
: This is the entry point for running anasyncio
application. It creates a new event loop, runs thescrape_all_urls
coroutine until it completes, and then closes the loop.
When you run python async_scraper.py
, you'll notice that the URLs are fetched almost simultaneously, leading to a significant speedup compared to fetching them one by one. This demonstrates the power of asyncio
for I/O-bound web scraping.
4. Parallel Scraping with concurrent.futures
(Threads and Processes)
While asyncio
excels at managing concurrent I/O operations within a single thread, Python's concurrent.futures
module provides a higher-level interface for asynchronously executing callables (Python Documentation: concurrent.futures, 2025). It abstracts away much of the complexity of managing threads and processes, offering ThreadPoolExecutor
and ProcessPoolExecutor
.
4.1. ThreadPoolExecutor
for I/O-Bound Concurrency
ThreadPoolExecutor
manages a pool of worker threads. As discussed in Chapter 2, due to Python's GIL, threads don't achieve true parallel execution of Python bytecode. However, for I/O-bound tasks like web scraping, threads do provide a benefit because the GIL is released during network waits (Python Documentation: Global Interpreter Lock, 2025). This means that while one thread is waiting for a website's response, another thread can acquire the GIL and send its own request.
ThreadPoolExecutor
can sometimes be simpler to implement for concurrent HTTP requests than asyncio
, especially if your existing scraper code is predominantly synchronous.
thread_scraper.py
:
import requests
from bs4 import BeautifulSoup
import json
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
import os
# List of URLs to scrape concurrently
URLS = [
"http://books.toscrape.com/catalogue/category/books/travel_2/index.html",
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
"http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html",
"http://books.toscrape.com/catalogue/category/books/classics_6/index.html",
"http://books.toscrape.com/catalogue/category/books/poetry_23/index.html",
"http://books.toscrape.com/catalogue/category/books/science_22/index.html",
"http://books.toscrape.com/catalogue/category/books/programming_33/index.html",
"http://books.toscrape.com/catalogue/category/books/fiction_10/index.html",
"http://books.toscrape.com/catalogue/category/books/childrens_11/index.html",
"http://books.toscrape.com/catalogue/category/books/humor_30/index.html",
]
def fetch_page(url: str) -> str | None:
"""Synchronously fetches the content of a single URL.
This function will be run by threads in the ThreadPoolExecutor.
Args:
url (str): The URL to fetch.
Returns:
str | None: The HTML content of the page if successful, None otherwise.
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
print(f"Fetched {url} - Status: {response.status}")
return response.text
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred while fetching {url}: {e}")
return None
def parse_books(html_content: str) -> list[dict]:
"""Parses book titles and prices from HTML content.
Args:
html_content (str): The HTML content of the page as a string.
Returns:
list[dict]: A list of dictionaries, where each dictionary represents a book
with 'title' and 'price' keys.
"""
if not html_content:
return []
soup = BeautifulSoup(html_content, 'html.parser')
books_data = []
articles = soup.find_all('article', class_='product_pod')
for article in articles:
try:
title = article.h3.a['title']
price = article.find('p', class_='price_color').text.strip()
books_data.append({'title': title, 'price': price})
except AttributeError as e:
# print(f"Skipping a book due to missing element: {e}") # Uncomment for debugging
continue
return books_data
def scrape_with_threads(urls: list[str], max_workers: int = 5) -> list[dict]:
"""Scrapes multiple URLs concurrently using ThreadPoolExecutor.
Args:
urls (list[str]): A list of URLs to scrape.
max_workers (int): The maximum number of threads to use.
Returns:
list[dict]: A consolidated list of all scraped books from all successful URLs.
"""
all_scraped_books = []
# Use ThreadPoolExecutor to run fetch_page on multiple URLs concurrently
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# Submit tasks to the executor
future_to_url = {executor.submit(fetch_page, url): url for url in urls}
# Iterate over completed futures as they finish
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
html_content = future.result() # Get the result of the task
if html_content:
print(f"Parsing content from {url}...")
parsed_data = parse_books(html_content)
all_scraped_books.extend(parsed_data)
else:
print(f"Skipping parsing for {url} due to failed fetch.")
except Exception as exc:
print(f"{url} generated an exception: {exc}")
return all_scraped_books
if __name__ == "__main__":
"""Main execution block for the multi-threaded scraper."""
start_time = time.time()
# Run the scraper with threads (e.g., 8 concurrent threads)
scraped_data = scrape_with_threads(URLS, max_workers=8)
end_time = time.time()
if scraped_data:
output_dir = os.getenv('OUTPUT_DIR', '.')
os.makedirs(output_dir, exist_ok=True)
output_file_path = os.path.join(output_dir, 'scraped_books_threads.json')
with open(output_file_path, 'w', encoding='utf-8') as f:
json.dump(scraped_data, f, ensure_ascii=False, indent=4)
print(f"\nSuccessfully scraped {len(scraped_data)} books. Data saved to {output_file_path}")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
else:
print("No books scraped or an error occurred during scraping.")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
4.2. ProcessPoolExecutor
for True Parallelism (CPU-Bound Tasks)
ProcessPoolExecutor
creates separate processes for each task. Because each process has its own Python interpreter and memory space, they can truly execute in parallel on multiple CPU cores, bypassing the GIL (Python Documentation: Global Interpreter Lock, 2025).
ProcessPoolExecutor
is ideal for:
- CPU-bound post-processing: If your scraper performs heavy calculations on the downloaded data (e.g., complex NLP, image processing, machine learning inference).
- Running independent scraping jobs: If you have multiple, distinct scraping tasks that don't share memory and can truly run in parallel.
- Bypassing GIL limitations: When the benefits of true parallelism outweigh the overhead of inter-process communication.
process_scraper.py
(Illustrative, focusing on parsing as a CPU-bound task):
To demonstrate, we'll modify the previous example to show how ProcessPoolExecutor
might be used for the parsing step, assuming parsing is now very CPU-intensive. In a real-world scenario, the Workspace_page
part would still be concurrent (e.g., with asyncio
or ThreadPoolExecutor
), and then the parse_books
calls would be offloaded to a ProcessPoolExecutor
.
import requests
from bs4 import BeautifulSoup
import json
import time
from concurrent.futures import ProcessPoolExecutor, as_completed
import os
# Using a subset of URLs for demonstration, as fetching is still sequential here for clarity
URLS_SUBSET = [
"http://books.toscrape.com/catalogue/category/books/travel_2/index.html",
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
"http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html",
]
def fetch_and_parse_single_url(url: str) -> list[dict]:
"""Synchronously fetches and parses a single URL.
This entire function will be run in a separate process.
Args:
url (str): The URL to fetch and parse.
Returns:
list[dict]: A list of scraped book dictionaries.
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
html_content = response.text
print(f"Fetched and parsing {url} in process {os.getpid()}")
# Simulate CPU-bound work in parsing (e.g., complex regex, data cleaning)
# For demonstration, let's add a small delay
time.sleep(0.1)
soup = BeautifulSoup(html_content, 'html.parser')
books_data = []
articles = soup.find_all('article', class_='product_pod')
for article in articles:
try:
title = article.h3.a['title']
price = article.find('p', class_='price_color').text.strip()
books_data.append({'title': title, 'price': price})
except AttributeError:
continue
return books_data
except requests.exceptions.RequestException as e:
print(f"Error processing {url} in process {os.getpid()}: {e}")
return []
except Exception as e:
print(f"An unexpected error occurred for {url} in process {os.getpid()}: {e}")
return []
def scrape_with_processes(urls: list[str], max_workers: int = None) -> list[dict]:
"""Scrapes and parses multiple URLs in parallel using ProcessPoolExecutor.
Args:
urls (list[str]): A list of URLs to scrape and parse.
max_workers (int, optional): The maximum number of processes to use.
Defaults to os.cpu_count().
Returns:
list[dict]: A consolidated list of all scraped books.
"""
all_scraped_books = []
# Use ProcessPoolExecutor for true parallelism
# max_workers defaults to os.cpu_count() if not specified
with ProcessPoolExecutor(max_workers=max_workers) as executor:
future_to_url = {executor.submit(fetch_and_parse_single_url, url): url for url in urls}
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
books = future.result()
all_scraped_books.extend(books)
except Exception as exc:
print(f"{url} generated an exception: {exc}")
return all_scraped_books
if __name__ == "__main__":
"""Main execution block for the multi-process scraper."""
start_time = time.time()
# Run the scraper with processes (defaults to number of CPU cores)
scraped_data = scrape_with_processes(URLS_SUBSET)
end_time = time.time()
if scraped_data:
output_dir = os.getenv('OUTPUT_DIR', '.')
os.makedirs(output_dir, exist_ok=True)
output_file_path = os.path.join(output_dir, 'scraped_books_processes.json')
with open(output_file_path, 'w', encoding='utf-8') as f:
json.dump(scraped_data, f, ensure_ascii=False, indent=4)
print(f"\nSuccessfully scraped {len(scraped_data)} books. Data saved to {output_file_path}")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
else:
print("No books scraped or an error occurred during scraping.")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
4.3. Choosing Between ThreadPoolExecutor
, ProcessPoolExecutor
, and asyncio
The choice depends on the nature of your scraper's bottlenecks:
asyncio
(andaiohttp
):- Best For: Purely I/O-bound tasks (most web scraping fetching).
- Pros: Highly efficient single-threaded concurrency, minimal overhead (no GIL contention or inter-process communication), scales well to thousands of concurrent connections.
- Cons: Requires
async
/await
syntax (can be a learning curve), need compatible async libraries (e.g.,aiohttp
instead ofrequests
).
ThreadPoolExecutor
:- Best For: I/O-bound tasks when
asyncio
is not feasible or desired (e.g., using existing synchronous libraries, simpler code for moderate concurrency). - Pros: Easier to integrate with existing synchronous code, simpler mental model for some.
- Cons: Still limited by GIL for CPU-bound parts, potential for race conditions if not careful with shared resources.
- Best For: I/O-bound tasks when
ProcessPoolExecutor
:- Best For: CPU-bound tasks (e.g., heavy parsing, complex data processing, ML inference on scraped data) where true parallelism is needed. Also useful for running independent scraping jobs that don't share resources.
- Pros: Bypasses the GIL, truly utilizes multiple CPU cores.
- Cons: Higher overhead due to process creation and inter-process communication (IPC), no shared memory (data must be pickled/unpickled), not suitable for tasks that require frequent shared state.
In web scraping, asyncio
is often the go-to for maximizing fetch speed. ProcessPoolExecutor
is then considered for heavy post-processing. ThreadPoolExecutor
serves as a middle ground or a simpler alternative to asyncio for fetching when the scale isn't extreme. Understanding these nuances allows you to architect the most efficient scraper for your specific needs.
5. Advanced Strategies for High-Performance Scraping
Turbocharging your scrapers with asynchronous and parallel techniques is powerful, but with great power comes great responsibility. Sending too many requests too quickly can overwhelm a server, leading to temporary (or even permanent) IP bans, captchas, or other anti-scraping measures. To maintain a good relationship with the websites you scrape and ensure the long-term success of your operations, you need to implement advanced strategies that combine speed with politeness and resilience.
5.1. Rate Limiting and Delays
While speed is the goal, hammering a server with thousands of requests per second is a surefire way to get blocked. Rate limiting is the practice of controlling the pace of your requests.
- Fixed Delays: The simplest approach is to introduce a fixed delay between requests.
- For
asyncio
: Useawait asyncio.sleep(delay_seconds)
. - For
concurrent.futures
(threads/processes): Usetime.sleep(delay_seconds)
.
- For
# Asyncio example with a fixed delay
import asyncio
import aiohttp
async def fetch_with_delay(session, url, delay=0.5):
await asyncio.sleep(delay) # Wait before sending the request
async with session.get(url) as response:
return await response.text()
# ThreadPoolExecutor example with a fixed delay (inside the function)
import requests
import time
def fetch_with_delay_sync(url, delay=0.5):
time.sleep(delay) # Wait before sending the request
response = requests.get(url)
return response.text
-
Dynamic Delays / Concurrency Limits: A more sophisticated approach involves dynamically adjusting delays based on server response or limiting the number of concurrent requests.
- Semaphore for Concurrency Control:
asyncio.Semaphore
(forasyncio
) or external queues can limit how many tasks can run simultaneously, preventing you from overwhelming the target server, even if you have many URLs (Real Python, 2025).
- Semaphore for Concurrency Control:
# Asyncio Semaphore example
import asyncio
import aiohttp
# Limit to 5 concurrent requests
CONCURRENT_REQUESTS_LIMIT = 5
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS_LIMIT)
async def fetch_page_limited(session, url):
async with semaphore: # Acquire a semaphore slot
print(f"Fetching {url} (active: {CONCURRENT_REQUESTS_LIMIT - semaphore._value})")
async with session.get(url, timeout=10) as response:
await asyncio.sleep(0.1) # Small politeness delay
response.raise_for_status()
return await response.text()
async def main_limited_scraper(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_page_limited(session, url) for url in urls]
await asyncio.gather(*tasks)
# asyncio.run(main_limited_scraper(URLS))
5.2. Exponential Backoff for Retries
Network requests are inherently unreliable. Connections can drop, servers can momentarily glitch, or a temporary rate limit might be imposed. Instead of failing immediately, implementing an exponential backoff strategy for retries makes your scraper much more robust (Google Developers, 2025).
The idea is simple: if a request fails, retry it after a small delay. If it fails again, double the delay. This continues up to a maximum number of retries or a maximum delay, preventing you from hammering a consistently failing endpoint.
# Pseudo-code for exponential backoff logic
import time
import random # For adding jitter
async def fetch_with_backoff(session, url, retries=3, initial_delay=1):
for i in range(retries):
try:
async with session.get(url, timeout=15) as response:
response.raise_for_status()
return await response.text()
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
print(f"Attempt {i+1} failed for {url}: {e}")
if i < retries - 1:
delay = initial_delay * (2 ** i) + random.uniform(0, 0.5) # Exponential + jitter
print(f"Retrying {url} in {delay:.2f} seconds...")
await asyncio.sleep(delay)
else:
print(f"All retries failed for {url}.")
return None
return None
# For synchronous requests (requests library)
def fetch_with_backoff_sync(url, retries=3, initial_delay=1):
for i in range(retries):
try:
response = requests.get(url, timeout=15)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"Attempt {i+1} failed for {url}: {e}")
if i < retries - 1:
delay = initial_delay * (2 ** i) + random.uniform(0, 0.5)
print(f"Retrying {url} in {delay:.2f} seconds...")
time.sleep(delay)
else:
print(f"All retries failed for {url}.")
return None
return None
5.3. Proxy Management for Concurrent Scrapers
When you make a high volume of requests from a single IP address, websites quickly identify this as bot activity and block you. Proxy servers act as intermediaries, routing your requests through different IP addresses. A proxy pool rotates these IP addresses with each request (or after a certain number of requests), making your activity appear to come from many different users (Scrapy, 2025).
Integrating proxies into asynchronous and parallel scrapers is crucial:
aiohttp
with Proxies:aiohttp
allows you to specify aproxy
argument in yoursession.get()
calls.requests
with Proxies: Therequests
library uses aproxies
dictionary.- Rotation Logic: You'll need logic to select a proxy from your pool (e.g., round-robin, random, or based on proxy health). This logic should be thread/process-safe if using
concurrent.futures
.
# Pseudo-code for proxy integration
PROXIES = [
'http://user:pass@proxy1.com:port',
'http://user:pass@proxy2.com:port',
# ... more proxies
]
def get_random_proxy():
return random.choice(PROXIES)
# Asyncio example
async def fetch_page_with_proxy(session, url):
proxy = get_random_proxy()
try:
async with session.get(url, proxy=proxy, timeout=10) as response:
# ... handle response
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy connection error for {url} via {proxy}: {e}")
# Mark proxy as bad or remove it from pool
return None
# Synchronous example
def fetch_page_with_proxy_sync(url):
proxy = get_random_proxy()
proxies_dict = {"http": proxy, "https": proxy}
try:
response = requests.get(url, proxies=proxies_dict, timeout=10)
# ... handle response
except requests.exceptions.ProxyError as e:
print(f"Proxy connection error for {url} via {proxy}: {e}")
return None
5.4. User-Agent Rotation
Along with IP addresses, websites also track User-Agent strings, which identify the browser or client making the request. Many requests from the same User-Agent (especially if it's a generic Python one) can flag your scraper as a bot. Maintaining a list of common browser User-Agents and rotating them with each request adds another layer of stealth.
# Example User-Agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
# ... more User-Agents
]
def get_random_user_agent():
return random.choice(USER_AGENTS)
# Usage in aiohttp
async def fetch_page_with_ua(session, url):
headers = {"User-Agent": get_random_user_agent()}
async with session.get(url, headers=headers) as response:
# ...
# Usage in requests
def fetch_page_with_ua_sync(url):
headers = {"User-Agent": get_random_user_agent()}
response = requests.get(url, headers=headers)
# ...
5.5. Comprehensive Error Handling and Robustness
Building high-performance scrapers means acknowledging that failures will happen. Websites change, networks drop, and proxies fail. Your scraper needs to be built with a high degree of resilience.
- Specific Exception Handling: Catch specific exceptions (
aiohttp.ClientError
,requests.exceptions.RequestException
,asyncio.TimeoutError
,AttributeError
during parsing) rather than broadException
catches. - Logging: Implement robust logging to track successes, failures, and their reasons. This is invaluable for debugging and monitoring long-running scraping jobs.
- Skipping Failed Items: Don't let one bad URL stop the entire scrape. Log the error and move on to the next.
- Data Validation: After parsing, validate the structure and content of your scraped data. Missing fields or unexpected formats should be flagged.
- Connection Management: Ensure
aiohttp.ClientSession
andrequests.Session
(forThreadPoolExecutor
if used) are properly closed to prevent resource leaks. Useasync with
and `with statements.
By combining the speed of asynchronous and parallel execution with these advanced strategies for politeness, stealth, and resilience, you can build scrapers that are not only fast but also reliable and sustainable for long-term data acquisition.
6. Combining Asynchronous and Parallel Approaches (Briefly)
So far, we've explored asyncio
for I/O-bound concurrency (fetching web pages) and concurrent.futures
(specifically ProcessPoolExecutor
) for CPU-bound parallelism (heavy data processing). In many real-world, high-performance scraping and data engineering scenarios, your workflow isn't exclusively I/O-bound or CPU-bound; it's often a mix. This is where a hybrid approach becomes incredibly powerful.
6.1. Hybrid Models: Leveraging Strengths
The most common hybrid model in web scraping combines asyncio
for the rapid, concurrent fetching of web pages with ProcessPoolExecutor
for the potentially CPU-intensive parsing or post-processing of the downloaded HTML/data.
This strategy plays to the strengths of each tool:
asyncio
(for Fetching): Excels at handling thousands of simultaneous, non-blocking network requests. It efficiently manages the "waiting" time that dominates the fetching phase, minimizing idle CPU time.ProcessPoolExecutor
(for Parsing/Processing): Bypasses Python's GIL by distributing CPU-bound tasks across multiple CPU cores (separate processes). This is ideal if your parsing logic is complex (e.g., using intricate regex, XPath, or even running lightweight ML models on text) or if you need to perform heavy data cleaning or transformation after the initial scrape.
The synergy is that asyncio
quickly gathers a large batch of raw data, which is then fed into a pool of processes that can work in parallel to crunch through that data without bottlenecking due to the GIL.
6.2. Conceptual Workflow Example
Let's illustrate this with a conceptual data flow for a hybrid scraper:
- URL Queue: Maintain a queue of URLs to be scraped.
- Asynchronous Fetching Pool (Main Process with
asyncio
): - The main Python process (running the
asyncio
event loop) usesaiohttp
to concurrently fetch HTML content from many URLs from the queue. - Each
Workspace_page coroutine
(potentially with rate limiting, retries, and proxy rotation) is responsible for downloading raw HTML. - As HTML content is downloaded, it's not immediately parsed if parsing is CPU-intensive. Instead, it's passed to the next stage.
- Parsing/Processing Pool (
ProcessPoolExecutor
): - The raw HTML content (and perhaps the original URL for context) is sent to a
ProcessPoolExecutor
. - Each worker process in this pool receives an HTML chunk and performs the CPU-intensive parsing (e.g.,
BeautifulSoup
parsing, complex data extraction, normalization). - These processes run in true parallel on different CPU cores.
- Data Storage/Output:
- Once a process finishes parsing a piece of HTML, the structured data (e.g., a dictionary for a book, a product, etc.) is sent back to the main process.
- The main process then collects these structured data points and saves them to a database, JSON file, CSV, or another persistent storage.
# Pseudo-code demonstrating the hybrid workflow
import asyncio
import aiohttp
from concurrent.futures import ProcessPoolExecutor
from bs4 import BeautifulSoup # Assuming this parsing is CPU-bound
import os
import time
# --- Part 1: Asynchronous Fetching ---
async def fetch_page_async(session, url):
"""Fetches a page asynchronously."""
# (Includes error handling, retries, proxy rotation, etc. from Chapter 5)
try:
async with session.get(url, timeout=15) as response:
response.raise_for_status()
print(f"Async fetched: {url}")
return await response.text()
except Exception as e:
print(f"Async fetch failed for {url}: {e}")
return None
# --- Part 2: Synchronous (CPU-bound) Parsing for Processes ---
def parse_html_sync(html_content, url_context):
"""Parses HTML content synchronously.
This function simulates a CPU-bound parsing task that will run in a separate process.
"""
if not html_content:
return None
# Simulate heavy CPU-bound work
time.sleep(0.05) # Simulate actual parsing time
try:
soup = BeautifulSoup(html_content, 'html.parser')
title_element = soup.find('h1') # Example extraction
if title_element:
title = title_element.text.strip()
# Add more complex parsing here as needed
print(f"Parsed in process {os.getpid()}: {title} from {url_context}")
return {'url': url_context, 'title': title, 'status': 'success'}
return {'url': url_context, 'title': 'N/A', 'status': 'no_title_found'}
except Exception as e:
print(f"Parsing error for {url_context} in process {os.getpid()}: {e}")
return {'url': url_context, 'title': 'N/A', 'status': 'parsing_error'}
# --- Main Hybrid Scraper Logic ---
async def hybrid_scraper(urls_to_scrape, num_fetch_workers=10, num_parse_workers=None):
scraped_results = []
# Initialize ProcessPoolExecutor outside the async function
# so it can be reused for multiple parsing tasks.
# It's crucial to manage process pools carefully in async contexts.
with ProcessPoolExecutor(max_workers=num_parse_workers) as parse_executor:
async with aiohttp.ClientSession() as session:
fetch_tasks = []
for url in urls_to_scrape:
fetch_tasks.append(fetch_page_async(session, url))
# Concurrently fetch all pages
html_contents_and_urls = []
for i, result in enumerate(await asyncio.gather(*fetch_tasks, return_exceptions=True)):
url = urls_to_scrape[i]
if isinstance(result, str): # Successful fetch
html_contents_and_urls.append((result, url))
else: # Failed fetch (exception or None)
print(f"Skipping parsing for {url} due to fetch issue.")
# Submit parsing tasks to the process pool
parse_futures = []
loop = asyncio.get_running_loop() # Get the current event loop
for html, url in html_contents_and_urls:
# Use loop.run_in_executor to offload a synchronous function to a separate thread/process
# Here, we pass it to our parse_executor (ProcessPoolExecutor)
future = loop.run_in_executor(parse_executor, parse_html_sync, html, url)
parse_futures.append(future)
# Wait for all parsing tasks to complete
for completed_future in asyncio.as_completed(parse_futures):
result = await completed_future # Await the result from the executor
if result:
scraped_results.append(result)
return scraped_results
# Example Usage
if __name__ == "__main__":
example_urls = [f"http://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 4)] # More pages
example_urls.extend([f"http://books.toscrape.com/catalogue/category/books/travel_2/page-{i}.html" for i in range(1,3)]) # Mix categories
start_time = time.time()
# Run the hybrid scraper (e.g., 20 concurrent fetches, 4 parallel parsing processes)
# Note: max_workers for ProcessPoolExecutor often defaults to os.cpu_count()
final_data = asyncio.run(hybrid_scraper(example_urls, num_fetch_workers=20, num_parse_workers=os.cpu_count()))
end_time = time.time()
print(f"\n--- Hybrid Scraping Results ---")
print(f"Total items scraped: {len(final_data)}")
print(f"Total time taken: {end_time - start_time:.2f} seconds.")
# for item in final_data:
# print(item) # Uncomment to see individual items
Explanation of the Hybrid Workflow:
- Workspace_page_async: Remains largely the same as in Chapter 3, using aiohttp for non-blocking network I/O.
- parse_html_sync: This is a standard synchronous function. Crucially, it's designed to be executed by a ProcessPoolExecutor. We've added a time.sleep to simulate CPU-intensive work, making the benefit of parallelism more obvious.
- hybrid_scraper:
- It first uses
asyncio.gather
to concurrently run allWorkspace_page_async
coroutines. This quickly downloads all the raw HTML. - Then, it retrieves the
asyncio
event loop usingasyncio.get_running_loop()
. loop.run_in_executor(parse_executor, parse_html_sync, html, url)
: This is the key bridge. It allows you to run a synchronous function (parse_html_sync
) in a separate executor (ourProcessPoolExecutor
) without blocking theasyncio
event loop. The result of this call is anasyncio.Future
object.asyncio.as_completed(parse_futures)
: This efficiently awaits the results from the parsing processes as they complete.
- It first uses
This hybrid model ensures that your scraper is always busy: while some coroutines are waiting for network responses, others are actively sending requests, and separate processes are simultaneously crunching through already downloaded data. This maximizes resource utilization and significantly boosts overall scraping throughput.
7. Best Practices and Considerations
Building powerful, high-performance scrapers is only half the battle. To ensure they run efficiently, reliably, and ethically in the long term, several best practices and considerations must be kept in mind. These range from technical monitoring to responsible behavior.
7.1. Resource Management: Monitoring Your Scrapers
High-performance scrapers, especially those leveraging concurrency and parallelism, can consume significant system resources. Without proper monitoring, you might inadvertently:
- Overwhelm your own machine: Leading to crashes, slow performance, or resource exhaustion.
- Waste cloud resources: If running on cloud instances, inefficient scrapers can incur unnecessary costs.
- Get throttled or blocked: If your network usage pattern is too aggressive.
Key resources to monitor:
- CPU Usage:
ProcessPoolExecutor
will use multiple CPU cores. Ensure your system can handle the load. Excessive CPU usage might indicate inefficient parsing or processing. - Memory Usage: Storing large amounts of scraped data in memory before saving, or holding many concurrent connections, can lead to high memory consumption. Look for memory leaks.
- Network I/O: Monitor outbound requests and inbound data. This is crucial for identifying if you're hitting network bottlenecks or if your proxy rotation isn't distributing traffic effectively.
- File Descriptors: Concurrent operations can open many network connections and files. Ensure your operating system's limits for file descriptors are sufficient.
Tools for Monitoring:
- Python's
resource
module (Unix-like systems): For basic process-level resource usage. psutil
library: A cross-platform library for retrieving information on running processes and system utilization (psutil Documentation, 2025).- System tools:
htop
,top
,glances
(Linux); Activity Monitor (macOS); Task Manager (Windows). - Cloud provider monitoring: AWS CloudWatch, Google Cloud Monitoring, Azure Monitor for cloud-based deployments.
7.2. Debugging Challenges in Concurrent/Parallel Code
Debugging sequential code is straightforward; debugging asynchronous and parallel code is significantly more complex.
- Non-deterministic Behavior: The exact order of execution can vary with each run, making it hard to reproduce bugs.
- Race Conditions: Multiple threads/processes trying to access or modify shared resources simultaneously can lead to unpredictable results (e.g., incorrect data, crashes). While
asyncio
is single-threaded andProcessPoolExecutor
uses separate memory spaces, shared resources (like external files or databases) still require careful synchronization. - Deadlocks: Threads or processes waiting indefinitely for each other to release resources can cause your scraper to freeze.
Best Practices for Debugging:
- Extensive Logging: This is your most powerful tool. Log what's happening at every critical step:
- Request sent (URL, proxy used).
- Response received (status code, initial headers).
- Errors and exceptions (full tracebacks).
- Data parsing success/failure.
- Resource usage (if monitored).
- Use Python's
logging
module effectively (Python Documentation: logging, 2025).
- Smaller, Focused Functions: Break down complex scraping logic into smaller, testable, synchronous functions that can be called by your concurrent/parallel executors.
- Isolation: Try to isolate issues by disabling concurrency/parallelism temporarily or running problematic URLs sequentially.
- Visual Debuggers: Some IDEs (like PyCharm) offer good debugging support for
asyncio
andconcurrent.futures
, allowing you to inspect states at different points.
7.3. Website Politeness and Ethical Considerations
High-performance scraping requires a heightened sense of responsibility. Being "polite" is not just about avoiding blocks; it's about ethical behavior and respecting website resources.
- Check
robots.txt
: Always check therobots.txt
file (e.g., https://example.com/robots.txt) before scraping. This file specifies rules about which parts of a site crawlers should or should not access. While not legally binding in most cases, ignoring it is a clear sign of disrespect and can lead to legal issues in some jurisdictions (Web Robots Pages, 2025). - Respect Rate Limits: Some websites explicitly state their rate limits in their
robots.txt
or terms of service. Adhere to them. Even without explicit limits, observe server behavior. If a site becomes slow or returns error codes, back off. - Identify Your Scraper: Use a descriptive
User-Agent
string that includes your contact information (e.g.,MyCompanyName-Scraper/1.0 (contact@mycompany.com))
. This allows website administrators to contact you if there's an issue, potentially avoiding a blanket ban. - Legal and Ethical Boundaries: Understand that just because data is public doesn't mean you can scrape it freely. Laws regarding data privacy (e.g., GDPR, CCPA), copyright, and terms of service vary by region and website. Consult legal counsel for critical projects (Wired, 2020).
7.4. Scalability Beyond a Single Machine
While asynchronous and parallel Python can significantly boost performance on a single machine, there comes a point where a single machine's resources (CPU, RAM, network bandwidth) become the limiting factor. For truly massive-scale scraping, you'll need to scale out:
- Containerization (Docker): As discussed in previous articles, Docker allows you to package your scraper and its dependencies into isolated, portable units. This is fundamental for consistent deployment across multiple machines or cloud environments (Docker Documentation, 2025).
- Orchestration (Kubernetes, Docker Swarm): For managing many Docker containers across a cluster of machines. These tools automate deployment, scaling, load balancing, and self-healing of your scraping infrastructure.
- Cloud Computing (AWS, GCP, Azure): Leverage cloud services like: Virtual Machines (EC2, Compute Engine): Spin up more powerful instances or multiple smaller instances. Serverless Functions (AWS Lambda, Google Cloud Functions): For event-driven, small-scale scrapes that run on demand. Managed Services (AWS Fargate, Google Cloud Run): For running containers without managing underlying servers. Queueing Systems (SQS, Pub/Sub): To manage large lists of URLs, allowing different workers to pick up tasks independently. Distributed Storage: Store scraped data directly into cloud storage like S3 or GCS.
- Distributed Scraping Frameworks: Tools like Scrapy (though primarily single-machine focused, it has extensions for distribution) or custom distributed systems can coordinate multiple scraper instances across a cluster.
By understanding and applying these best practices, your high-performance scrapers will not only be fast but also reliable, maintainable, and considerate members of the internet ecosystem.
8. Conclusion
We've reached the end of our deep dive into turbocharging your web scrapers. We began by identifying the fundamental bottleneck in web scraping: the inherent slowness of sequential operations due to network I/O. This led us to explore the powerful concepts of concurrency and parallelism, understanding how they allow us to perform multiple tasks seemingly simultaneously or truly simultaneously, respectively.
We then dove into Python's primary tools for these paradigms:
asyncio
withaiohttp
emerged as the champion for I/O-bound concurrency, proving incredibly efficient for making a high volume of non-blocking network requests. Its event-loop architecture allows your scraper to keep busy while waiting for web servers to respond.concurrent.futures
, providingThreadPoolExecutor
andProcessPoolExecutor
, offered flexible ways to manage tasks.ThreadPoolExecutor
serves well for I/O-bound tasks whereasyncio
might be overkill or when integrating with existing synchronous code, benefiting from the GIL's release during I/O waits.ProcessPoolExecutor
, by spawning separate processes, allowed us to truly bypass the GIL forCPU-bound parallelism
, making it ideal for heavy data parsing or complex post-processing.- We also saw how to combine these approaches into a hybrid model, leveraging
asyncio
for rapid fetching andProcessPoolExecutor
for efficient, parallel data processing.
Beyond mere speed, we emphasized crucial advanced strategies that ensure your high-performance scrapers are not just fast, but also ethical, robust, and sustainable. This included implementing rate limiting
and exponential backoff
for politeness and resilience, as well as integrating proxy management
and User-Agent rotation
to avoid IP blocks and mimic legitimate user behavior. Finally, we touched upon essential resource management
and debugging considerations
, alongside how to think about scaling beyond a single machine
for truly massive projects.
By integrating asynchronous and parallel programming into your web scraping toolkit, you gain:
- Significant Performance Gains: Drastically reduce the time it takes to collect large datasets.
- Enhanced Efficiency: Maximize the utilization of your system's resources.
- Increased Robustness: Build scrapers that can gracefully handle network errors and website changes.
- Scalability: Lay the groundwork for distributing your scraping operations across multiple machines or cloud environments.
The modern web is dynamic and vast, and effective data acquisition demands sophisticated approaches. Armed with the knowledge and practical examples from this article, you are now well-equipped to turbocharge your scrapers, making them faster, more reliable, and ultimately, more valuable assets in your data engineering and development efforts.
Go forth and scrape efficiently!
Used Sources:
- aiohttp Documentation. (2025). aiohttp. Retrieved from https://docs.aiohttp.org/en/stable/
- Data Engineering Weekly. (2025). Why Data Freshness Matters. Retrieved from https://dataengineeringweekly.com/
- Docker Documentation. (2025). What is Docker?. Retrieved from https://docs.docker.com/get-started/overview/
- Google Developers. (2025). Exponential Backoff. Retrieved from https://cloud.google.com/storage/docs/exponential-backoff
- Mozilla Developer Network. (2025). HTTP response status codes. Retrieved from https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
- psutil Documentation. (2025). psutil. Retrieved from https://psutil.readthedocs.io/en/latest/
- Python Documentation. (2025). asyncio — Asynchronous I/O. Retrieved from https://docs.python.org/3/library/asyncio.html
- Python Documentation. (2025). concurrent.futures — Launching parallel tasks. Retrieved from https://docs.python.org/3/library/concurrent.futures.html
- Python Documentation. (2025). Global Interpreter Lock. Retrieved from https://docs.python.org/3/glossary.html#term-global-interpreter-lock
- Python Documentation. (2025). logging — Logging facility for Python. Retrieved from https://docs.python.org/3/library/logging.html
- Real Python. (2025). Async IO in Python: A Complete Walkthrough. Retrieved from https://realpython.com/async-io-python/
- Scrapy. (2025). Scrapy architecture overview (Proxy Middleware). Retrieved from https://docs.scrapy.org/en/latest/topics/architecture.html#proxy-middleware
- Stack Overflow. (2025). What is the difference between concurrency and parallelism?. Retrieved from https://stackoverflow.com/questions/thread/1054087/what-is-the-difference-between-concurrency-and-parallelism
- Web Robots Pages. (2025). The Web Robots Pages (robots.txt). Retrieved from https://www.robotstxt.org/
- Wired. (2020). The Legal Battle Over Web Scraping Heats Up. Retrieved from https://www.wired.com/story/legal-battle-web-scraping-heats-up/