RequestQueue
Hierarchy
- Storage
- RequestManager
- RequestQueue
Index
Methods
__init__
Initialize a new instance.
Preferably use the
RequestQueue.open
constructor to create a new instance.Parameters
client: RequestQueueClient
An instance of a storage client.
id: str
The unique identifier of the storage.
name: str | None
The name of the storage, if available.
Returns None
add_request
Add a single request to the manager and store it in underlying resource client.
Parameters
request: str | Request
The request object (or its string representation) to be added to the manager.
optionalkeyword-onlyforefront: bool = False
Determines whether the request should be added to the beginning (if True) or the end (if False) of the manager.
Returns ProcessedRequest
add_requests
Add requests to the manager in batches.
Parameters
requests: Sequence[str | Request]
Requests to enqueue.
optionalkeyword-onlyforefront: bool = False
If True, add requests to the beginning of the queue.
optionalkeyword-onlybatch_size: int = 1000
The number of requests to add in one batch.
optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)
Time to wait between adding batches.
optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
If True, wait for all requests to be added before returning.
optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Timeout for waiting for all requests to be added.
Returns None
drop
Drop the storage, removing it from the underlying storage client and clearing the cache.
Returns None
fetch_next_request
Return the next request in the queue to be processed.
Once you successfully finish processing of the request, you need to call
RequestQueue.mark_request_as_handled
to mark the request as handled in the queue. If there was some error in processing the request, callRequestQueue.reclaim_request
instead, so that the queue will give the request to some other consumer in another call to thefetch_next_request
method.Note that the
None
return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, useRequestQueue.is_finished
instead.Returns Request | None
get_handled_count
Get the number of requests in the loader that have been handled.
Returns int
get_metadata
Get the storage metadata.
Returns (DatasetMetadata | KeyValueStoreMetadata) | RequestQueueMetadata
get_request
Retrieve a specific request from the queue by its ID.
Parameters
request_id: str
The ID of the request to retrieve.
Returns Request | None
get_total_count
Get an offline approximation of the total number of requests in the loader (i.e. pending + handled).
Returns int
is_empty
Check if the request queue is empty.
An empty queue means that there are no requests currently in the queue, either pending or being processed. However, this does not necessarily mean that the crawling operation is finished, as there still might be tasks that could add additional requests to the queue.
Returns bool
is_finished
Check if the request queue is finished.
A finished queue means that all requests in the queue have been processed (the queue is empty) and there are no more tasks that could add additional requests to the queue. This is the definitive way to check if a crawling operation is complete.
Returns bool
mark_request_as_handled
Mark a request as handled after successful processing.
This method should be called after a request has been successfully processed. Once marked as handled, the request will be removed from the queue and will not be returned in subsequent calls to
fetch_next_request
method.Parameters
request: Request
The request to mark as handled.
Returns ProcessedRequest | None
open
Open a storage, either restore existing or create a new one.
Parameters
optionalkeyword-onlyid: str | None = None
The storage ID.
optionalkeyword-onlyname: str | None = None
The storage name.
optionalkeyword-onlyconfiguration: Configuration | None = None
Configuration object used during the storage creation or restoration process.
optionalkeyword-onlystorage_client: StorageClient | None = None
Underlying storage client to use. If not provided, the default global storage client from the service locator will be used.
Returns Storage
purge
Purge the storage, removing all items from the underlying storage client.
This method does not remove the storage itself, e.g. don't remove the metadata, but clears all items within it.
Returns None
reclaim_request
Reclaim a failed request back to the queue for later processing.
If a request fails during processing, this method can be used to return it to the queue. The request will be returned for processing again in a subsequent call to
RequestQueue.fetch_next_request
.Parameters
request: Request
The request to return to the queue.
optionalkeyword-onlyforefront: bool = False
If true, the request will be added to the beginning of the queue. Otherwise, it will be added to the end.
Returns ProcessedRequest | None
to_tandem
Combine the loader with a request manager to support adding and reclaiming requests.
Parameters
optionalrequest_manager: RequestManager | None = None
Request manager to combine the loader with. If None is given, the default request queue is used.
Returns RequestManagerTandem
Properties
id
Get the storage ID.
name
Get the storage name.
Request queue is a storage for managing HTTP requests.
The request queue class serves as a high-level interface for organizing and managing HTTP requests during web crawling. It provides methods for adding, retrieving, and manipulating requests throughout the crawling lifecycle, abstracting away the underlying storage implementation details.
Request queue maintains the state of each URL to be crawled, tracking whether it has been processed, is currently being handled, or is waiting in the queue. Each URL in the queue is uniquely identified by a
unique_key
property, which prevents duplicate processing unless explicitly configured otherwise.The class supports both breadth-first and depth-first crawling strategies through its
forefront
parameter when adding requests. It also provides mechanisms for error handling and request reclamation when processing fails.You can open a request queue using the
open
class method, specifying either a name or ID to identify the queue. The underlying storage implementation is determined by the configured storage client.Usage