Skip to content

Repository structure modernization #502

@sleeepyjack

Description

@sleeepyjack

We'd like to discuss two friction points in the repo layout, both about the public include surface. Filing as a discussion because any change here would break the public API and warrant a major version bump; better to agree on the destination before anyone writes a patch.

Problem 1: Generic top-level header names collide across libraries

Public headers sit at shallow paths with generic filenames, installed flat under a single directory:

hll/include/hll.hpp
theta/include/theta_sketch.hpp
common/include/serde.hpp
common/include/optional.hpp
# ...installed via `DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/DataSketches"`

Consumers write #include <hll.hpp>. Nothing in the path identifies the library, and names like hll.hpp, serde.hpp, optional.hpp, and tdigest.hpp are plausible choices for any number of unrelated projects. A consumer linking against datasketches alongside another library that also ships an hll.hpp ends up with two headers competing for the same include path, with no clean way to disambiguate.

The fix is to namespace the include path under the library name. Two shapes are worth considering.

Option A: Nested per-sketch directories, drop the filename prefix

include/datasketches/
  hll/
    sketch.hpp                 # was hll.hpp
    detail/...
  theta/
    sketch.hpp                 # was theta_sketch.hpp
    union.hpp                  # was theta_union.hpp
    intersection.hpp
    a_not_b.hpp
    jaccard_similarity.hpp
    detail/...
  cpc/
    sketch.hpp
    detail/...
  ...

Consumer: #include <datasketches/theta/union.hpp>

The directory does the namespacing work; the filename drops the now-redundant theta_ prefix. Matches Boost (<boost/graph/adjacency_list.hpp>).

Option B: Flat namespace, keep descriptive filenames

include/datasketches/
  hll_sketch.hpp                # was hll.hpp
  theta_sketch.hpp
  theta_union.hpp
  theta_intersection.hpp
  theta_a_not_b.hpp
  theta_jaccard_similarity.hpp
  cpc_sketch.hpp
  ...
  detail/
    hll/...
    theta/...

Consumer: #include <datasketches/theta_union.hpp>

The filename prefix carries the grouping, and the public surface stays on a single flat level. Matches Abseil (<absl/container/flat_hash_map.h>).

Either way, the install path becomes ${CMAKE_INSTALL_INCLUDEDIR}/datasketches/.

Problem 2: Public and internal headers live in the same directory

hll/include/ mixes the public hll.hpp with roughly 30 implementation headers (e.g. HllSketchImpl.hpp, AuxHashMap.hpp) plus another 15-odd *-internal.hpp files. All of them get installed by hll/CMakeLists.txt. From a consumer standpoint, this seems valid:

#include <DataSketches/HllSketchImpl.hpp>
#include <DataSketches/AuxHashMap-internal.hpp>

There is no clear API boundary between public and internal headers.

The fix is a per-sketch detail/ subdirectory, a convention Boost has used since the early 2000s and that Abseil mirrors with its internal/ directories. Implementation headers move there, and the contract for consumers is straightforward: anything inside detail/ is internal, and reaching in is at your own risk.

Low-priority items

  • Header guards. Two styles coexist: hll/include/HllArray.hpp uses _HLLARRAY_HPP_ (leading underscore, reserved by the standard); theta/include/theta_sketch.hpp uses THETA_SKETCH_HPP_. Switching all headers to #pragma once removes the inconsistency, the reserved-identifier risk, and a class of copy-paste mistakes.
  • No .clang-format. Style is enforced by convention only. Committing a single file would replace formatting back-and-forth in code review.
  • Pick one casing convention. File names and code currently mix CamelCase and snake_case. Low-priority but easy on the eyes.
  • Tests scattered per-sketch. Each <sketch>/test/CMakeLists.txt repeats the Catch2 wiring and test-discovery setup. Consolidating into a single test/ tree (with per-sketch subdirs) would declare the test framework once, give shared fixtures (random key generators, serde helpers) an obvious home. The current setup works; this is cleanup, not a fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions