Skip to content

Cloud Storage

ProgridPy uses S3Handler for all AWS S3 interactions, providing concurrent uploads and downloads with progress bars, automatic retries, and server-side encryption. The handler is designed to be used as a context manager.

S3Handler

Basic Usage

from progridpy.aws.s3 import S3Handler, S3ObjectRef
from pathlib import Path

with S3Handler() as s3:
    downloaded, skipped, failed = s3.download_objects(
        bucket="my-bucket",
        objects=[
            S3ObjectRef(key="data/file1.parquet", local_path=Path("./file1.parquet")),
            S3ObjectRef(key="data/file2.parquet", local_path=Path("./file2.parquet")),
        ],
    )

S3Handler must be used as a context manager. The __enter__ method creates a boto3.Session and S3 client; __exit__ tears them down.

Constructor Parameters

S3Handler(config: S3TransferConfig | None = None, verbose: bool = False)
Parameter Type Description
config S3TransferConfig \| None Transfer configuration. Uses defaults if None.
verbose bool Enable verbose logging for debugging.

S3TransferConfig

Fine-tune transfer behavior with a frozen dataclass:

from progridpy.aws.s3 import S3TransferConfig

config = S3TransferConfig(
    region="us-west-2",                  # AWS region
    max_pool_connections=50,             # HTTP connection pool size
    max_concurrency=20,                  # Parallel transfer threads
    multipart_threshold=8 * 1024 * 1024, # 8 MB -- switch to multipart above this
    multipart_chunksize=8 * 1024 * 1024, # 8 MB per part
    retry_attempts=3,                    # Automatic retries on failure
    enable_encryption=True,              # Server-side encryption
    encryption_type="AES256",            # Encryption algorithm
    existence_check_threshold=256,       # HEAD vs LIST threshold for skip checks
)
Field Default Description
region "us-west-2" AWS region for the S3 client
max_pool_connections 50 Maximum HTTP connections in the pool
max_concurrency 20 Maximum parallel transfer threads
multipart_threshold 8 MB File size above which multipart transfer is used
multipart_chunksize 8 MB Size of each multipart chunk
retry_attempts 3 Number of retry attempts with adaptive backoff
enable_encryption True Enable server-side encryption on uploads
encryption_type "AES256" Server-side encryption algorithm
existence_check_threshold 256 Below this count, use HEAD per object; above, use LIST prefix

S3ObjectRef

A frozen dataclass that pairs an S3 key with a local file path:

from progridpy.aws.s3 import S3ObjectRef
from pathlib import Path

ref = S3ObjectRef(
    key="iso=spp/dataset=nodal/year=2026/month=01/day=15/data.parquet",
    local_path=Path("./processed/year=2026/month=01/day=15/data.parquet"),
)

Downloading Objects

def download_objects(
    self,
    bucket: str,
    objects: list[S3ObjectRef],
    overwrite: bool = False,
    description: str | None = None,
) -> tuple[list[Path], list[Path], list[Path]]:

Returns a tuple of (downloaded, skipped, failed) path lists.

from progridpy.aws.s3 import S3Handler, S3ObjectRef
from pathlib import Path

refs = [
    S3ObjectRef(key=f"data/day={d}/data.parquet", local_path=Path(f"./data/day={d}/data.parquet"))
    for d in range(1, 32)
]

with S3Handler() as s3:
    downloaded, skipped, failed = s3.download_objects(
        bucket="progrid-datalake",
        objects=refs,
        overwrite=False,
        description="Downloading January data",
    )

print(f"Downloaded: {len(downloaded)}, Skipped: {len(skipped)}, Failed: {len(failed)}")

Skip behavior

When overwrite=False (the default), files that already exist locally are added to the skipped list without making any network requests.

Uploading Objects

def upload_objects(
    self,
    bucket: str,
    objects: list[S3ObjectRef],
    overwrite: bool = False,
    description: str | None = None,
) -> tuple[list[str], list[str], list[Path]]:

Returns a tuple of (uploaded_keys, skipped_keys, failed_local_paths).

from progridpy.aws.s3 import S3Handler, S3ObjectRef
from pathlib import Path

refs = [
    S3ObjectRef(
        key="iso=miso/dataset=nodal/year=2026/month=01/day=15/data.parquet",
        local_path=Path("./processed/year=2026/month=01/day=15/data.parquet"),
    ),
]

with S3Handler() as s3:
    uploaded, skipped, failed = s3.upload_objects(
        bucket="progrid-datalake",
        objects=refs,
        description="Uploading MISO processed data",
    )

When overwrite=False, the handler checks whether each key already exists in S3 before uploading. For small batches (under existence_check_threshold), it uses individual HEAD requests; for larger batches, it uses LIST prefix queries for efficiency.

Uploads use server-side encryption by default (AES256). Disable with:

config = S3TransferConfig(enable_encryption=False)
with S3Handler(config=config) as s3:
    ...

Concurrent Transfers

Both download_objects() and upload_objects() execute transfers concurrently using a ThreadPoolExecutor with max_concurrency workers. A rich progress bar displays real-time transfer speed, percentage, and ETA.

To increase throughput for large batch operations:

config = S3TransferConfig(
    max_pool_connections=100,
    max_concurrency=50,
)

with S3Handler(config=config) as s3:
    s3.download_objects(bucket="my-bucket", objects=large_object_list)

Hive Path Date Extraction

Extract a datetime from a Hive-partitioned S3 key:

from progridpy.aws.s3 import extract_date_from_hive_path

dt = extract_date_from_hive_path("iso=spp/dataset=nodal/year=2026/month=01/day=15/data.parquet")
# datetime(2026, 1, 15, 0, 0)

dt = extract_date_from_hive_path("some/other/path.csv")
# None -- returns None when no Hive partition pattern is found

The function matches the pattern year=YYYY/month=MM/day=DD anywhere in the path string.

AWS Credential Configuration

S3Handler reads credentials from the standard AWS credential chain. Set credentials via environment variables:

export AWS_PROFILE=your-profile          # Named profile from ~/.aws/credentials
export AWS_REGION=us-west-2              # Override the default region
export AWS_ACCESS_KEY_ID=...             # Direct credential injection
export AWS_SECRET_ACCESS_KEY=...

The handler respects both AWS_PROFILE and AWS_DEFAULT_PROFILE environment variables for named profile selection.