Skip to content

Antalya 26.3: support external paths in Iceberg tables#1812

Open
zvonand wants to merge 3 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-90740
Open

Antalya 26.3: support external paths in Iceberg tables#1812
zvonand wants to merge 3 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-90740

Conversation

@zvonand
Copy link
Copy Markdown
Member

@zvonand zvonand commented May 19, 2026

Cherry-pick of upstream PR ClickHouse#90740 ("Read iceberg from various paths") to antalya-26.3.

The upstream PR is rebased on top of master that includes ClickHouse#100420 (IcebergPath / path_resolver / Spark-Azure interop refactor). That refactor brings ~1500 LoC unrelated to the feature itself, so it is not ported here — instead, PR 90740 is adapted to use raw String paths.

The same approach was used for the antalya-26.1 backport ( #1461 , single commit 0520e2ee3b9 "Allow to read iceberg table data from any location"), which is the structural reference for this port.

What's kept from 90740:

  • SecondaryStorages infrastructure (thread-safe map<string, ObjectStoragePtr>) and resolveObjectStorageForPath helper — the actual feature.
  • New protocol version DBMS_CLUSTER_PROCESSING_PROTOCOL_VERSION_WITH_ICEBERG_ABSOLUTE_PATH = 7 and the new data_object_file_metadata_path / requires_external_storage fields on IcebergObjectSerializableInfo.
  • New enable_url_encoding parameter on S3::URI ctor.
  • New tests/integration/test_storage_iceberg_multistorage/.

What's adapted:

  • IcebergPathFromMetadataString; no .serialize(), no IcebergPathFromMetadata::deserialize(...) wrapping.
  • const IcebergPathResolver & path_resolver parameters → const String & table_location; path_resolver.resolve(x) becomes x.
  • S3UriStyle uri_style ctor parameter dropped on S3::URI (type does not exist on antalya-26.3).

What's dropped (depends on upstream commits not on antalya-26.3):

  • ExpireSnapshotsExecute.{cpp,h}, RemoveOrphanFilesExecute.{cpp,h}, SnapshotFilesTraversal.{cpp,h} — extracted per-command EXECUTE handlers from an unrelated refactor; PR 90740 only threads secondary_storages into them. The antalya-26.3 Iceberg::expireSnapshots + expireSnapshotsResultToPipe path is kept unchanged in IcebergMetadata::executeCommand.
  • The PR's executeExpireSnapshots / executeRemoveOrphanFiles switch branches.
  • All non-90740 changes brought in by Resolve problems with paths and compatibility problems with Spark in Azure (v2) ClickHouse/ClickHouse#100420 (Spark/Azure interop test fixtures, Glue catalog tweaks, etc.).

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support Iceberg tables that have data files outside the table location or on a different object storage. Cherry-picked from ClickHouse#90740 (by @zvonand).

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@zvonand zvonand added releasy Created/managed by RelEasy antalya-26.3 ai-resolved Port conflict auto-resolved by Claude auto-prereq-added Combined PR includes auto-added prerequisite PR(s) labels May 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Workflow [PR], commit [ccad2ef]

zvonand added a commit to Altinity/RelEasy that referenced this pull request May 19, 2026
`commit_cherry_pick_conflict_as_is` and `commit_conflict_markers`
were doing `git add --all` before committing the with-conflict-
markers checkpoint. That sweeps everything in the working tree that
isn't gitignored — and real C++ repos accumulate plenty outside
.gitignore: ClickHouse leaves server runtime data under
`tmp/server_data*/store/<uuid>/<part>/...cmrk2`, build pipelines
spit out generated headers, autosaves, etc.

bug seen 2026-05-19: Altinity/ClickHouse#1812 ended up with
**696 429 additions across 19 683 files** because tmp/server_data*
was tracked-modified at the time of cherry-pick and got swept in.

new helper `_stage_unmerged_paths` uses `git diff --name-only
--diff-filter=U` to stage exactly the conflict-marked files. The
clean parts of the cherry-pick are already staged by git
automatically — only the unmerged paths (whose textual content is
the markers themselves) need explicit staging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zvonand zvonand force-pushed the feature/antalya-26.3/ClickHouse-ClickHouse-pr-90740 branch from 5636ee3 to 90519de Compare May 19, 2026 22:45
Copy link
Copy Markdown
Member Author

@zvonand zvonand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title shall be changed to one of 90740 (as this is the main port). Same for changelog.

@zvonand zvonand added the port-antalya PRs to be ported to all new Antalya releases label May 20, 2026
@zvonand zvonand changed the title Antalya 26.3: Resolve problems with paths and compatibility problems with Spark in Azure (v2) Antalya 26.3: support external paths in Iceberg tables May 20, 2026
zvonand and others added 2 commits May 20, 2026 22:28
Adapted PR ClickHouse#90740 (Read iceberg from various paths) to antalya-26.3
without applying the prerequisite upstream PR ClickHouse#100420 (IcebergPath /
path_resolver refactor). The refactor is dropped; raw `String` paths are
used instead.

Adaptations from PR 90740 to antalya-26.3:

- `IcebergPathFromMetadata` references → plain `String` (no `.serialize()`,
  no `IcebergPathFromMetadata::deserialize` wrapping).
- `IcebergPathResolver & path_resolver` parameters → `const String &
  table_location`. Calls like `path_resolver.resolve(x)` become `x`.
- `SecondaryStorages` infrastructure kept: thread-safe map of secondary
  object storages plus a `resolveObjectStorageForPath` helper that maps
  a metadata path to a (storage, key) pair. The IcebergPath-aware
  overload of `resolveObjectStorageForPath` was removed.
- New protocol version `DBMS_CLUSTER_PROCESSING_PROTOCOL_VERSION_WITH_ICEBERG_ABSOLUTE_PATH = 7`
  used in `IcebergObjectSerializableInfo::{serializeForClusterFunctionProtocol,
  deserializeForClusterFunctionProtocol}` to gate the new
  `data_object_file_metadata_path` field and `requires_external_storage`
  check; `_path` for delete files goes through `SchemeAuthorityKey` on
  older protocols.

Dropped (depend on upstream commits not on antalya-26.3):

- `ExpireSnapshotsExecute.{cpp,h}`, `RemoveOrphanFilesExecute.{cpp,h}`,
  `SnapshotFilesTraversal.{cpp,h}` — extracted EXECUTE handlers from
  upstream PR introducing per-command refactor. PR 90740 only threads
  `secondary_storages` into these; the underlying refactor is a
  separate dependency. The antalya-26.3 `Iceberg::expireSnapshots` path
  is kept unchanged in `IcebergMetadata::executeCommand`.
- `executeExpireSnapshots` / `executeRemoveOrphanFiles` dispatch in
  `IcebergMetadata::executeCommand` — depends on the dropped files.

References:

- Upstream PR: ClickHouse#90740
- antalya-26.1 backport (used as a structural reference for the
  no-IcebergPath adaptation): 0520e2e
  ("Allow to read iceberg table data from any location")

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- src/Storages/ObjectStorage/Utils.cpp: drop leftover `S3UriStyle::AUTO`
  arguments from two `S3::URI` ctor calls (S3UriStyle does not exist on
  antalya-26.3; the S3UriStyle parameter was already dropped from the
  ctor signature in the cherry-pick).
- src/Storages/ObjectStorage/DataLakes/Iceberg/Mutations.cpp:
  `collectRetainedFiles` and `collectExpiredFiles` now pass a local
  empty `SecondaryStorages` to `getManifestList` /
  `getManifestFileEntriesHandle` so the new mandatory parameter
  compiles without threading external storages through the
  `Iceberg::expireSnapshots` dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zvonand zvonand force-pushed the feature/antalya-26.3/ClickHouse-ClickHouse-pr-90740 branch from b7609fa to 4cd644b Compare May 20, 2026 21:16
@zvonand

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-resolved Port conflict auto-resolved by Claude antalya antalya-26.3 antalya-26.3.10.20001 auto-prereq-added Combined PR includes auto-added prerequisite PR(s) port-antalya PRs to be ported to all new Antalya releases releasy Created/managed by RelEasy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants