Skip to content

Downloader

Document Processing Orchestrator Downloader Package.

Modules:

Name Description
BaseDownloader

Abstract base class for document downloader.

DirectFileURLDownloader

Downloader for direct file URL.

GoogleDriveDownloader

Downloader for Google Drive files.

SmartCrawlDownloader

Downloader for Smart Crawl data API.

BaseDownloader

Bases: ABC

Base class for document downloader.

download(source, output, **kwargs) abstractmethod

Download source to the output directory.

Parameters:

Name Type Description Default
source str

The source to be downloaded.

required
output str

The output directory where the downloaded source will be saved.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
list[str] | None

list[str] | None: A list of file paths of successfully downloaded files. If no files are downloaded, an empty list should be returned. Returning None is only for backward compatibility and should be avoided in new implementations.

DirectFileURLDownloader(stream_buffer_size=65536, retry_config=None, max_retries=DEFAULT_MAX_RETRIES, timeout=None)

Bases: BaseDownloader

A class for downloading files from a direct file URL to the defined output directory.

Initialize the DirectFileURLDownloader.

Parameters:

Name Type Description Default
stream_buffer_size int

The size of the buffer for streaming downloads in bytes. Defaults to 64KB (65536 bytes).

65536
retry_config RetryConfig | dict[str, Any] | None

Retry configuration. When a dict, it is validated into RetryConfig. When None, a default RetryConfig is built using max_retries and timeout. Defaults to None. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS.

None
max_retries int

The maximum number of retries for failed downloads. Defaults to 3. Deprecated: Use retry_config instead.

DEFAULT_MAX_RETRIES
timeout int | None

The timeout for the download request in seconds. Defaults to None. Deprecated: Use retry_config instead.

None

download(source, output, **kwargs)

Download source to the output directory.

Parameters:

Name Type Description Default
source str

The source to be downloaded.

required
output str

The output directory where the downloaded source will be saved.

required
**kwargs Any

Additional keyword arguments.

{}
kwargs

ca_certs_path (str, optional): The path to the CA certificates file. Defaults to None. extension (str, optional): The extension of the file to be downloaded. If not provided, the extension will be detected from the response headers or content mime type.

Returns:

Type Description
list[str]

list[str]: A list of file paths of successfully downloaded files.

GoogleDriveDownloader(api_key, identifier, secret, api_base_url='https://api.bosa.id')

Bases: BaseDownloader

A class for downloading files from Google Drive using GL Connectors for Google Drive integration.

Initialize the GoogleDriveDownloader.

Parameters:

Name Type Description Default
api_key str

The API key for the GL Connectors API.

required
identifier str

The identifier for the GL Connectors user.

required
secret str

The secret for the GL Connectors user.

required
api_base_url str

The base URL for the GL Connectors API. Defaults to "https://api.bosa.id".

'https://api.bosa.id'

download(source, output, **kwargs)

Download a file from Google Drive to the output directory.

Parameters:

Name Type Description Default
source str

The Google Drive file ID or URL.

required
output str

The output directory where the downloaded file will be saved.

required
**kwargs Any

Additional keyword arguments.

{}
Kwargs

export_format (str, optional): The export format for the file.

Returns:

Type Description
list[str]

list[str]: A list containing the path(s) to the successfully downloaded file(s).

Raises:

Type Description
ValueError

If file ID cannot be extracted or no files are returned from Google Drive.

SmartCrawlDownloader(endpoint_url='https://stag-smart-crawl.obrol.id/api/v1/data', retry_config=None)

Bases: BaseDownloader

A downloader for retrieving crawled records from Smart Crawl.

Initialize the SmartCrawlDownloader.

Parameters:

Name Type Description Default
endpoint_url str

The URL of the Smart Crawl API endpoint. Defaults to "https://stag-smart-crawl.obrol.id/api/v1/data".

'https://stag-smart-crawl.obrol.id/api/v1/data'
retry_config RetryConfig | dict[str, Any] | None

Retry configuration. When a dict, it is validated into RetryConfig. When None, a default RetryConfig is built. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS. Defaults to None.

None

download(source, output, **kwargs)

Download the data from the Smart Crawl API.

Parameters:

Name Type Description Default
source str

The smart crawl domains to be downloaded in comma separated format.

required
output str

The output directory where the downloaded data will be saved.

required
**kwargs Any

Additional keyword arguments.

{}
Kwargs

start_date (str): Start datetime in ISO 8601 with timezone. end_date (str): End datetime in ISO 8601 with timezone. queries (str, optional): Comma separated of search queries. schema (str, optional): Comma separated of fields to be included in the response. page (int, optional): The page number to be downloaded. page_size (int, optional): The number of items to be downloaded per page. after_timestamp (str, optional): The ISO 8601 timestamp to be used as the cursor for the next page.

Returns:

Type Description
list[str]

list[str]: A list of file paths of successfully downloaded files.