[core] Add file_name fields to files system table#7667
Conversation
37cf0e4 to
f066dcf
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Review: Add file_name and clustering_columns fields to files system table
Overall this is a clean, well-scoped change. The tests cover both the configured and unconfigured clustering cases, and the documentation is updated. A few observations:
1. Caching pattern inconsistency
The existing keyConverters function (line ~364 on master) uses computeIfAbsent:
return keyConverterMap.computeIfAbsent(schemaId, k -> { ... });The new clusteringColumnsGetter uses the manual containsKey + get + put pattern:
if (cache.containsKey(schemaId)) {
return cache.get(schemaId);
}
// ...
cache.put(schemaId, cols);
return cols;For consistency and brevity, consider using computeIfAbsent here as well:
return cache.computeIfAbsent(schemaId, id -> {
TableSchema dataSchema = schemaManager.schema(id);
CoreOptions options = new CoreOptions(dataSchema.options());
return options.clusteringColumns();
});2. clustering_columns is table-level metadata repeated per file
The clustering columns come from the schema options and will be identical for every file sharing the same schemaId. This is technically redundant per-row, but since it can vary across schema versions (schema evolution) and the lazy evaluation avoids upfront cost, this is acceptable. Just wanted to confirm this design is intentional rather than something that would be better served by $options or $schemas system tables.
3. Test helper getExpectedResult field ordering
The new fields in getExpectedResult are added as:
file.firstRowId(),
null, // write_cols
BinaryString.fromString(file.fileName()), // file_name
null)); // clustering_columnsThis matches the schema field ordering (indices 19, 20, 21). Looks correct.
4. Minor: file_name non-nullable is correct
Defining file_name as SerializationUtils.newStringType(false) (NOT NULL) is the right choice since every data file must have a name. Good.
Summary
The PR is well-implemented with good test coverage. The main suggestion is to align the caching pattern with the existing computeIfAbsent style used by keyConverters in the same method.
JingsongLi
left a comment
There was a problem hiding this comment.
-1 to clustering_columns, this is strange to show information from schema file.
f066dcf to
36a33e2
Compare
|
Thanks for thorough review, makes sense. Dropped clustering_columns, kept only file_name. Tests and docs adjusted, rebased on master. PTAL.
|
Purpose
Closes #7283.
Added
file_nameandclustering_columnsto the$filessystem table.file_name: just the file name, without the path prefixclustering_columns: the clustering columns from table schema options, comma-separated. NULL if not configured.the issue title says "clustering column and file name system table" — not sure if the intent was a standalone
$clustering_columnstable or just adding fields to an existing one. I went with adding them to$filessince it's only two fields and didn't feel like it warranted a separate table. Happy to change if a standalone table is preferred.Tests
FilesTableTestexpected results to include the new fieldstestFileNameAndClusteringColumns: creates a table withclustering.columnsset, verifies both fieldstestClusteringColumnsNull: verifies the field is NULL when clustering is not configured