[core] Add file_name fields to files system table by heye1005 · Pull Request #7667 · apache/paimon

heye1005 · 2026-04-18T11:38:24Z

Purpose

Closes #7283.

Added file_name and clustering_columns to the $files system table.

file_name: just the file name, without the path prefix
clustering_columns: the clustering columns from table schema options, comma-separated. NULL if not configured.

the issue title says "clustering column and file name system table" — not sure if the intent was a standalone $clustering_columns table or just adding fields to an existing one. I went with adding them to $files since it's only two fields and didn't feel like it warranted a separate table. Happy to change if a standalone table is preferred.

Tests

Updated existing FilesTableTest expected results to include the new fields
Added testFileNameAndClusteringColumns: creates a table with clustering.columns set, verifies both fields
Added testClusteringColumnsNull: verifies the field is NULL when clustering is not configured

JingsongLi

Review: Add file_name and clustering_columns fields to files system table

Overall this is a clean, well-scoped change. The tests cover both the configured and unconfigured clustering cases, and the documentation is updated. A few observations:

1. Caching pattern inconsistency

The existing keyConverters function (line ~364 on master) uses computeIfAbsent:

return keyConverterMap.computeIfAbsent(schemaId, k -> { ... });

The new clusteringColumnsGetter uses the manual containsKey + get + put pattern:

if (cache.containsKey(schemaId)) {
    return cache.get(schemaId);
}
// ...
cache.put(schemaId, cols);
return cols;

For consistency and brevity, consider using computeIfAbsent here as well:

return cache.computeIfAbsent(schemaId, id -> {
    TableSchema dataSchema = schemaManager.schema(id);
    CoreOptions options = new CoreOptions(dataSchema.options());
    return options.clusteringColumns();
});

2. `clustering_columns` is table-level metadata repeated per file

The clustering columns come from the schema options and will be identical for every file sharing the same schemaId. This is technically redundant per-row, but since it can vary across schema versions (schema evolution) and the lazy evaluation avoids upfront cost, this is acceptable. Just wanted to confirm this design is intentional rather than something that would be better served by $options or $schemas system tables.

3. Test helper `getExpectedResult` field ordering

The new fields in getExpectedResult are added as:

file.firstRowId(),
null,                                        // write_cols
BinaryString.fromString(file.fileName()),    // file_name
null));                                      // clustering_columns

This matches the schema field ordering (indices 19, 20, 21). Looks correct.

4. Minor: `file_name` non-nullable is correct

Defining file_name as SerializationUtils.newStringType(false) (NOT NULL) is the right choice since every data file must have a name. Good.

Summary

The PR is well-implemented with good test coverage. The main suggestion is to align the caching pattern with the existing computeIfAbsent style used by keyConverters in the same method.

JingsongLi

-1 to clustering_columns, this is strange to show information from schema file.

…ble (apache#7283)

heye1005 · 2026-05-24T14:07:42Z

Thanks for thorough review, makes sense. Dropped clustering_columns, kept only file_name. Tests and docs adjusted, rebased on master. PTAL.

-1 to clustering_columns, this is strange to show information from schema file.

heye1005 mentioned this pull request Apr 18, 2026

[Feature] Add clustering column and file name system table #7283

Open

2 tasks

heye1005 force-pushed the add-clustering-to-files-table branch from 37cf0e4 to f066dcf Compare April 18, 2026 14:44

JingsongLi reviewed May 23, 2026

View reviewed changes

JingsongLi requested changes May 23, 2026

View reviewed changes

[core] Add file_name and clustering_columns fields to files system ta…

36a33e2

…ble (apache#7283)

heye1005 force-pushed the add-clustering-to-files-table branch from f066dcf to 36a33e2 Compare May 24, 2026 14:03

heye1005 changed the title ~~[core] Add file_name and clustering_columns fields to files system table~~ [core] Add file_name fields to files system table May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Add file_name fields to files system table#7667

[core] Add file_name fields to files system table#7667
heye1005 wants to merge 1 commit into
apache:masterfrom
heye1005:add-clustering-to-files-table

heye1005 commented Apr 18, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

JingsongLi left a comment

Uh oh!

heye1005 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

heye1005 commented Apr 18, 2026

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: Add file_name and clustering_columns fields to files system table

1. Caching pattern inconsistency

2. clustering_columns is table-level metadata repeated per file

3. Test helper getExpectedResult field ordering

4. Minor: file_name non-nullable is correct

Summary

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

heye1005 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `clustering_columns` is table-level metadata repeated per file

3. Test helper `getExpectedResult` field ordering

4. Minor: `file_name` non-nullable is correct