RequestQueue

Request queue is a storage for managing HTTP requests.

The request queue class serves as a high-level interface for organizing and managing HTTP requests during web crawling. It provides methods for adding, retrieving, and manipulating requests throughout the crawling lifecycle, abstracting away the underlying storage implementation details.

Request queue maintains the state of each URL to be crawled, tracking whether it has been processed, is currently being handled, or is waiting in the queue. Each URL in the queue is uniquely identified by a unique_key property, which prevents duplicate processing unless explicitly configured otherwise.

The class supports both breadth-first and depth-first crawling strategies through its forefront parameter when adding requests. It also provides mechanisms for error handling and request reclamation when processing fails.

You can open a request queue using the open class method, specifying either a name or ID to identify the queue. The underlying storage implementation is determined by the configured storage client.

Usage

from crawlee.storages import RequestQueue

# Open a request queue
rq = await RequestQueue.open(name='my_queue')

# Add a request
await rq.add_request('https://example.com')

# Process requests
request = await rq.fetch_next_request()
if request:
    try:
        # Process the request
        # ...
        await rq.mark_request_as_handled(request)
    except Exception:
        await rq.reclaim_request(request)

Hierarchy

Storage
RequestManager
- RequestQueue

Index

Methods

Properties

Methods

init

__init__(client, id, name): None

Initialize a new instance.

Preferably use the RequestQueue.open constructor to create a new instance.
Parameters
- client: RequestQueueClient
  An instance of a storage client.
- id: str
  The unique identifier of the storage.
- name: str | None
  The name of the storage, if available.
Returns None

add_request

async add_request(request, *, forefront): ProcessedRequest

Overrides RequestManager.add_request
Add a single request to the manager and store it in underlying resource client.
Parameters
- request: str | Request
  The request object (or its string representation) to be added to the manager.
- optionalkeyword-onlyforefront: bool = False
  Determines whether the request should be added to the beginning (if True) or the end (if False) of the manager.
Returns ProcessedRequest

add_requests

async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None

Overrides RequestManager.add_requests
Add requests to the manager in batches.
Parameters
- requests: Sequence[str | Request]
  Requests to enqueue.
- optionalkeyword-onlyforefront: bool = False
  If True, add requests to the beginning of the queue.
- optionalkeyword-onlybatch_size: int = 1000
  The number of requests to add in one batch.
- optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)
  Time to wait between adding batches.
- optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
  If True, wait for all requests to be added before returning.
- optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
  Timeout for waiting for all requests to be added.
Returns None

drop

async drop(): None

Overrides RequestManager.drop
Drop the storage, removing it from the underlying storage client and clearing the cache.
Returns None

fetch_next_request

async fetch_next_request(): Request | None

Overrides RequestManager.fetch_next_request
Return the next request in the queue to be processed.

Once you successfully finish processing of the request, you need to call RequestQueue.mark_request_as_handled to mark the request as handled in the queue. If there was some error in processing the request, call RequestQueue.reclaim_request instead, so that the queue will give the request to some other consumer in another call to the fetch_next_request method.

Note that the None return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use RequestQueue.is_finished instead.
Returns Request | None

get_handled_count

async get_handled_count(): int

Overrides RequestManager.get_handled_count
Get the number of requests in the loader that have been handled.
Returns int

get_metadata

async get_metadata(): (DatasetMetadata | KeyValueStoreMetadata) | RequestQueueMetadata

Overrides Storage.get_metadata
Get the storage metadata.
Returns (DatasetMetadata | KeyValueStoreMetadata) | RequestQueueMetadata

get_request

async get_request(request_id): Request | None

Retrieve a specific request from the queue by its ID.
Parameters
- request_id: str
  The ID of the request to retrieve.
Returns Request | None

get_total_count

async get_total_count(): int

Overrides RequestManager.get_total_count
Get an offline approximation of the total number of requests in the loader (i.e. pending + handled).
Returns int

is_empty

async is_empty(): bool

Overrides RequestManager.is_empty
Check if the request queue is empty.

An empty queue means that there are no requests currently in the queue, either pending or being processed. However, this does not necessarily mean that the crawling operation is finished, as there still might be tasks that could add additional requests to the queue.
Returns bool

is_finished

async is_finished(): bool

Overrides RequestManager.is_finished
Check if the request queue is finished.

A finished queue means that all requests in the queue have been processed (the queue is empty) and there are no more tasks that could add additional requests to the queue. This is the definitive way to check if a crawling operation is complete.
Returns bool

mark_request_as_handled

async mark_request_as_handled(request): ProcessedRequest | None

Overrides RequestManager.mark_request_as_handled
Mark a request as handled after successful processing.

This method should be called after a request has been successfully processed. Once marked as handled, the request will be removed from the queue and will not be returned in subsequent calls to fetch_next_request method.
Parameters
- request: Request
  The request to mark as handled.
Returns ProcessedRequest | None

open

async open(*, id, name, configuration, storage_client): Storage

Overrides Storage.open
Open a storage, either restore existing or create a new one.
Parameters
- optionalkeyword-onlyid: str | None = None
  The storage ID.
- optionalkeyword-onlyname: str | None = None
  The storage name.
- optionalkeyword-onlyconfiguration: Configuration | None = None
  Configuration object used during the storage creation or restoration process.
- optionalkeyword-onlystorage_client: StorageClient | None = None
  Underlying storage client to use. If not provided, the default global storage client from the service locator will be used.
Returns Storage

purge

async purge(): None

Overrides Storage.purge
Purge the storage, removing all items from the underlying storage client.

This method does not remove the storage itself, e.g. don't remove the metadata, but clears all items within it.
Returns None

reclaim_request

async reclaim_request(request, *, forefront): ProcessedRequest | None

Overrides RequestManager.reclaim_request
Reclaim a failed request back to the queue for later processing.

If a request fails during processing, this method can be used to return it to the queue. The request will be returned for processing again in a subsequent call to RequestQueue.fetch_next_request.
Parameters
- request: Request
  The request to return to the queue.
- optionalkeyword-onlyforefront: bool = False
  If true, the request will be added to the beginning of the queue. Otherwise, it will be added to the end.
Returns ProcessedRequest | None

to_tandem

async to_tandem(request_manager): RequestManagerTandem

Inherited from RequestLoader.to_tandem
Combine the loader with a request manager to support adding and reclaiming requests.
Parameters
- optionalrequest_manager: RequestManager | None = None
  Request manager to combine the loader with. If None is given, the default request queue is used.
Returns RequestManagerTandem

Properties

id

id: str

Get the storage ID.

name

name: str | None

Get the storage name.

Usage

Hierarchy

Index

Methods

Properties

Methods

__init__

Parameters

client: RequestQueueClient

id: str

name: str | None

Returns None

add_request

Parameters

request: str | Request

optionalkeyword-onlyforefront: bool = False

Returns ProcessedRequest

add_requests

Parameters

requests: Sequence[str | Request]

optionalkeyword-onlyforefront: bool = False

optionalkeyword-onlybatch_size: int = 1000

optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)

optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

Returns None

drop

Returns None

fetch_next_request

Returns Request | None

get_handled_count

Returns int

get_metadata

Returns (DatasetMetadata | KeyValueStoreMetadata) | RequestQueueMetadata

get_request

Parameters

request_id: str

Returns Request | None

get_total_count

Returns int

is_empty

Returns bool

is_finished

Returns bool

mark_request_as_handled

Parameters

request: Request

Returns ProcessedRequest | None

open

Parameters

optionalkeyword-onlyid: str | None = None

optionalkeyword-onlyname: str | None = None

optionalkeyword-onlyconfiguration: Configuration | None = None

optionalkeyword-onlystorage_client: StorageClient | None = None

Returns Storage

purge

Returns None

reclaim_request

Parameters

request: Request

optionalkeyword-onlyforefront: bool = False

Returns ProcessedRequest | None

to_tandem

Parameters

optionalrequest_manager: RequestManager | None = None

Returns RequestManagerTandem

Properties

id

name

init