Skip to content

Parquet metadata read ignores storage_options, breaking typed reads from non-AWS S3 stores #352

@mattijsdp

Description

Summary

Schema.read_parquet / read_from (and scan_parquet, collection reads) fail
against S3-compatible stores reached via storage_options (lakeFS gateway, MinIO,
R2, Tigris, …). The data read gets storage_options, but the embedded-schema
metadata
read does not — so polars falls back to credential_provider="auto" +
the default AWS S3 endpoint, hitting the wrong endpoint/creds.

Repro

opts = {  # lakeFS S3 gateway; any non-AWS S3 store reproduces this
    "aws_access_key_id": "…", "aws_secret_access_key": "…",
    "aws_endpoint_url": "https://lakefs.example.com",
    "aws_virtual_hosted_style_request": "false",
}
MySchema.read_parquet("s3://repo/branch/table.parquet", storage_options=opts)
# -> botocore TokenRetrievalError / AccessDenied: the call resolves the default
#    AWS chain + real AWS endpoint, not the store described by storage_options.
df = pl.read_parquet("s3://repo/branch/table.parquet", storage_options=opts)  # works

Cause

_storage/parquet.py: storage_options arrives in **kwargs and is passed to
pl.read_parquet / pl.scan_parquet, but the metadata helpers call
pl.read_parquet_metadata(path) with no options:

def read_frame(self, **kwargs):
    source = kwargs.pop("source")
    df = pl.read_parquet(source, **kwargs)        # storage_options
    metadata = _read_serialized_schema(source)    # dropped

pl.read_parquet_metadata accepts storage_options= — it's just not forwarded.

Fix

Thread kwargs.get("storage_options") into the metadata helpers in
dataframely/_storage/parquet.py:

  • _read_serialized_schema (L253–255) and _read_serialized_collection (L248–250):
    add storage_options=None param → pl.read_parquet_metadata(path, storage_options=storage_options)
  • call sites: read_frame (L62), scan_frame (L56), _collection_from_parquet
    (L162/L167), scan_failure_info (L237)

(Threading credential_provider too would also fix auto-resolution.)

Version

dataframely 2.10.1, polars . Happy to open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions