We'd like to discuss two friction points in the repo layout, both about the public include surface. Filing as a discussion because any change here would break the public API and warrant a major version bump; better to agree on the destination before anyone writes a patch.
Problem 1: Generic top-level header names collide across libraries
Public headers sit at shallow paths with generic filenames, installed flat under a single directory:
hll/include/hll.hpp
theta/include/theta_sketch.hpp
common/include/serde.hpp
common/include/optional.hpp
# ...installed via `DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/DataSketches"`
Consumers write #include <hll.hpp>. Nothing in the path identifies the library, and names like hll.hpp, serde.hpp, optional.hpp, and tdigest.hpp are plausible choices for any number of unrelated projects. A consumer linking against datasketches alongside another library that also ships an hll.hpp ends up with two headers competing for the same include path, with no clean way to disambiguate.
The fix is to namespace the include path under the library name. Two shapes are worth considering.
Option A: Nested per-sketch directories, drop the filename prefix
include/datasketches/
hll/
sketch.hpp # was hll.hpp
detail/...
theta/
sketch.hpp # was theta_sketch.hpp
union.hpp # was theta_union.hpp
intersection.hpp
a_not_b.hpp
jaccard_similarity.hpp
detail/...
cpc/
sketch.hpp
detail/...
...
Consumer: #include <datasketches/theta/union.hpp>
The directory does the namespacing work; the filename drops the now-redundant theta_ prefix. Matches Boost (<boost/graph/adjacency_list.hpp>).
Option B: Flat namespace, keep descriptive filenames
include/datasketches/
hll_sketch.hpp # was hll.hpp
theta_sketch.hpp
theta_union.hpp
theta_intersection.hpp
theta_a_not_b.hpp
theta_jaccard_similarity.hpp
cpc_sketch.hpp
...
detail/
hll/...
theta/...
Consumer: #include <datasketches/theta_union.hpp>
The filename prefix carries the grouping, and the public surface stays on a single flat level. Matches Abseil (<absl/container/flat_hash_map.h>).
Either way, the install path becomes ${CMAKE_INSTALL_INCLUDEDIR}/datasketches/.
Problem 2: Public and internal headers live in the same directory
hll/include/ mixes the public hll.hpp with roughly 30 implementation headers (e.g. HllSketchImpl.hpp, AuxHashMap.hpp) plus another 15-odd *-internal.hpp files. All of them get installed by hll/CMakeLists.txt. From a consumer standpoint, this seems valid:
#include <DataSketches/HllSketchImpl.hpp>
#include <DataSketches/AuxHashMap-internal.hpp>
There is no clear API boundary between public and internal headers.
The fix is a per-sketch detail/ subdirectory, a convention Boost has used since the early 2000s and that Abseil mirrors with its internal/ directories. Implementation headers move there, and the contract for consumers is straightforward: anything inside detail/ is internal, and reaching in is at your own risk.
Low-priority items
- Header guards. Two styles coexist:
hll/include/HllArray.hpp uses _HLLARRAY_HPP_ (leading underscore, reserved by the standard); theta/include/theta_sketch.hpp uses THETA_SKETCH_HPP_. Switching all headers to #pragma once removes the inconsistency, the reserved-identifier risk, and a class of copy-paste mistakes.
- No
.clang-format. Style is enforced by convention only. Committing a single file would replace formatting back-and-forth in code review.
- Pick one casing convention. File names and code currently mix CamelCase and snake_case. Low-priority but easy on the eyes.
- Tests scattered per-sketch. Each
<sketch>/test/CMakeLists.txt repeats the Catch2 wiring and test-discovery setup. Consolidating into a single test/ tree (with per-sketch subdirs) would declare the test framework once, give shared fixtures (random key generators, serde helpers) an obvious home. The current setup works; this is cleanup, not a fix.
We'd like to discuss two friction points in the repo layout, both about the public include surface. Filing as a discussion because any change here would break the public API and warrant a major version bump; better to agree on the destination before anyone writes a patch.
Problem 1: Generic top-level header names collide across libraries
Public headers sit at shallow paths with generic filenames, installed flat under a single directory:
Consumers write
#include <hll.hpp>. Nothing in the path identifies the library, and names likehll.hpp,serde.hpp,optional.hpp, andtdigest.hppare plausible choices for any number of unrelated projects. A consumer linking against datasketches alongside another library that also ships anhll.hppends up with two headers competing for the same include path, with no clean way to disambiguate.The fix is to namespace the include path under the library name. Two shapes are worth considering.
Option A: Nested per-sketch directories, drop the filename prefix
Consumer:
#include <datasketches/theta/union.hpp>The directory does the namespacing work; the filename drops the now-redundant
theta_prefix. Matches Boost (<boost/graph/adjacency_list.hpp>).Option B: Flat namespace, keep descriptive filenames
Consumer:
#include <datasketches/theta_union.hpp>The filename prefix carries the grouping, and the public surface stays on a single flat level. Matches Abseil (
<absl/container/flat_hash_map.h>).Either way, the install path becomes
${CMAKE_INSTALL_INCLUDEDIR}/datasketches/.Problem 2: Public and internal headers live in the same directory
hll/include/mixes the publichll.hppwith roughly 30 implementation headers (e.g.HllSketchImpl.hpp,AuxHashMap.hpp) plus another 15-odd*-internal.hppfiles. All of them get installed byhll/CMakeLists.txt. From a consumer standpoint, this seems valid:There is no clear API boundary between public and internal headers.
The fix is a per-sketch
detail/subdirectory, a convention Boost has used since the early 2000s and that Abseil mirrors with itsinternal/directories. Implementation headers move there, and the contract for consumers is straightforward: anything insidedetail/is internal, and reaching in is at your own risk.Low-priority items
hll/include/HllArray.hppuses_HLLARRAY_HPP_(leading underscore, reserved by the standard);theta/include/theta_sketch.hppusesTHETA_SKETCH_HPP_. Switching all headers to#pragma onceremoves the inconsistency, the reserved-identifier risk, and a class of copy-paste mistakes..clang-format. Style is enforced by convention only. Committing a single file would replace formatting back-and-forth in code review.<sketch>/test/CMakeLists.txtrepeats the Catch2 wiring and test-discovery setup. Consolidating into a singletest/tree (with per-sketch subdirs) would declare the test framework once, give shared fixtures (random key generators, serde helpers) an obvious home. The current setup works; this is cleanup, not a fix.