Summary
Schema.read_parquet / read_from (and scan_parquet, collection reads) fail
against S3-compatible stores reached via storage_options (lakeFS gateway, MinIO,
R2, Tigris, …). The data read gets storage_options, but the embedded-schema
metadata read does not — so polars falls back to credential_provider="auto" +
the default AWS S3 endpoint, hitting the wrong endpoint/creds.
Repro
opts = { # lakeFS S3 gateway; any non-AWS S3 store reproduces this
"aws_access_key_id": "…", "aws_secret_access_key": "…",
"aws_endpoint_url": "https://lakefs.example.com",
"aws_virtual_hosted_style_request": "false",
}
MySchema.read_parquet("s3://repo/branch/table.parquet", storage_options=opts)
# -> botocore TokenRetrievalError / AccessDenied: the call resolves the default
# AWS chain + real AWS endpoint, not the store described by storage_options.
df = pl.read_parquet("s3://repo/branch/table.parquet", storage_options=opts) # works
Cause
_storage/parquet.py: storage_options arrives in **kwargs and is passed to
pl.read_parquet / pl.scan_parquet, but the metadata helpers call
pl.read_parquet_metadata(path) with no options:
def read_frame(self, **kwargs):
source = kwargs.pop("source")
df = pl.read_parquet(source, **kwargs) # storage_options
metadata = _read_serialized_schema(source) # dropped
pl.read_parquet_metadata accepts storage_options= — it's just not forwarded.
Fix
Thread kwargs.get("storage_options") into the metadata helpers in
dataframely/_storage/parquet.py:
_read_serialized_schema (L253–255) and _read_serialized_collection (L248–250):
add storage_options=None param → pl.read_parquet_metadata(path, storage_options=storage_options)
- call sites:
read_frame (L62), scan_frame (L56), _collection_from_parquet
(L162/L167), scan_failure_info (L237)
(Threading credential_provider too would also fix auto-resolution.)
Version
dataframely 2.10.1, polars . Happy to open a PR.
Summary
Schema.read_parquet/read_from(andscan_parquet, collection reads) failagainst S3-compatible stores reached via
storage_options(lakeFS gateway, MinIO,R2, Tigris, …). The data read gets
storage_options, but the embedded-schemametadata read does not — so polars falls back to
credential_provider="auto"+the default AWS S3 endpoint, hitting the wrong endpoint/creds.
Repro
Cause
_storage/parquet.py:storage_optionsarrives in**kwargsand is passed topl.read_parquet/pl.scan_parquet, but the metadata helpers callpl.read_parquet_metadata(path)with no options:pl.read_parquet_metadataacceptsstorage_options=— it's just not forwarded.Fix
Thread
kwargs.get("storage_options")into the metadata helpers indataframely/_storage/parquet.py:_read_serialized_schema(L253–255) and_read_serialized_collection(L248–250):add
storage_options=Noneparam →pl.read_parquet_metadata(path, storage_options=storage_options)read_frame(L62),scan_frame(L56),_collection_from_parquet(L162/L167),
scan_failure_info(L237)(Threading
credential_providertoo would also fix auto-resolution.)Version
dataframely 2.10.1, polars . Happy to open a PR.