Skip to content

[core] Add file_name fields to files system table#7667

Open
heye1005 wants to merge 1 commit into
apache:masterfrom
heye1005:add-clustering-to-files-table
Open

[core] Add file_name fields to files system table#7667
heye1005 wants to merge 1 commit into
apache:masterfrom
heye1005:add-clustering-to-files-table

Conversation

@heye1005
Copy link
Copy Markdown
Contributor

Purpose

Closes #7283.

Added file_name and clustering_columns to the $files system table.

  • file_name: just the file name, without the path prefix
  • clustering_columns: the clustering columns from table schema options, comma-separated. NULL if not configured.

the issue title says "clustering column and file name system table" — not sure if the intent was a standalone $clustering_columns table or just adding fields to an existing one. I went with adding them to $files since it's only two fields and didn't feel like it warranted a separate table. Happy to change if a standalone table is preferred.

Tests

  • Updated existing FilesTableTest expected results to include the new fields
  • Added testFileNameAndClusteringColumns: creates a table with clustering.columns set, verifies both fields
  • Added testClusteringColumnsNull: verifies the field is NULL when clustering is not configured

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add file_name and clustering_columns fields to files system table

Overall this is a clean, well-scoped change. The tests cover both the configured and unconfigured clustering cases, and the documentation is updated. A few observations:

1. Caching pattern inconsistency

The existing keyConverters function (line ~364 on master) uses computeIfAbsent:

return keyConverterMap.computeIfAbsent(schemaId, k -> { ... });

The new clusteringColumnsGetter uses the manual containsKey + get + put pattern:

if (cache.containsKey(schemaId)) {
    return cache.get(schemaId);
}
// ...
cache.put(schemaId, cols);
return cols;

For consistency and brevity, consider using computeIfAbsent here as well:

return cache.computeIfAbsent(schemaId, id -> {
    TableSchema dataSchema = schemaManager.schema(id);
    CoreOptions options = new CoreOptions(dataSchema.options());
    return options.clusteringColumns();
});

2. clustering_columns is table-level metadata repeated per file

The clustering columns come from the schema options and will be identical for every file sharing the same schemaId. This is technically redundant per-row, but since it can vary across schema versions (schema evolution) and the lazy evaluation avoids upfront cost, this is acceptable. Just wanted to confirm this design is intentional rather than something that would be better served by $options or $schemas system tables.

3. Test helper getExpectedResult field ordering

The new fields in getExpectedResult are added as:

file.firstRowId(),
null,                                        // write_cols
BinaryString.fromString(file.fileName()),    // file_name
null));                                      // clustering_columns

This matches the schema field ordering (indices 19, 20, 21). Looks correct.

4. Minor: file_name non-nullable is correct

Defining file_name as SerializationUtils.newStringType(false) (NOT NULL) is the right choice since every data file must have a name. Good.

Summary

The PR is well-implemented with good test coverage. The main suggestion is to align the caching pattern with the existing computeIfAbsent style used by keyConverters in the same method.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to clustering_columns, this is strange to show information from schema file.

@heye1005 heye1005 force-pushed the add-clustering-to-files-table branch from f066dcf to 36a33e2 Compare May 24, 2026 14:03
@heye1005 heye1005 changed the title [core] Add file_name and clustering_columns fields to files system table [core] Add file_name fields to files system table May 24, 2026
@heye1005
Copy link
Copy Markdown
Contributor Author

Thanks for thorough review, makes sense. Dropped clustering_columns, kept only file_name. Tests and docs adjusted, rebased on master. PTAL.

-1 to clustering_columns, this is strange to show information from schema file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add clustering column and file name system table

2 participants