github · aneubeck · May 6, 2026 · May 7, 2026 · May 7, 2026 · May 7, 2026
@@ -4,6 +4,7 @@ members = [
     "crates/*",
     "crates/bpe/benchmarks",
     "crates/bpe/tests",
+    "crates/hash-sorted-map/benchmarks",
 ]
 resolver = "2"
 

@@ -8,3 +8,5 @@ repository = "https://github.com/github/rust-gems"
 license = "MIT"
 keywords = ["hashmap", "sorted", "merge", "simd"]
 categories = ["algorithms", "data-structures"]
+
+[dependencies]
@@ -4,8 +4,8 @@
 
 `HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow
 chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2),
-a **slot-hint fast path**, and an **optimized growth strategy**. It is generic
-over key type, value type, and hash builder.
+and an **optimized growth strategy**. It is generic over key type, value type,
+and hash builder.
 
 This document analyzes the design trade-offs versus
 [hashbrown](https://github.com/rust-lang/hashbrown) and records the
@@ -38,7 +38,6 @@ experimental results that guided the current design.
 │  • Overflow chaining (linked groups)                             │
 │  • 8-byte groups with NEON/SSE2/scalar SIMD scan                 │
 │  • EMPTY / FULL tag states only (insertion-only, no deletion)    │
-│  • Slot-hint fast path                                           │
 └──────────────────────────────────────────────────────────────────┘
 ```
 
@@ -106,17 +105,33 @@ the overflow path.
 SIMD version** by pessimizing NEON code generation. Removed from the SIMD
 implementation, kept in the scalar version.
 
-### 7. Slot Hint Fast Path (Unique to HashSortedMap)
+### 7. Slot Hint Fast Path ❌ Removed
 
-HashSortedMap checks a preferred slot before scanning the group:
+Originally, HashSortedMap checked a preferred slot before scanning the group:
 ```rust
 let hint = slot_hint(hash);  // 3 bits from hash → slot index
 if ctrl[hint] == EMPTY { /* direct insert */ }
 if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ }
 ```
 
-hashbrown does **not** have this optimization — it always does a full SIMD
-group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors.
+**Experimental finding**: This scalar check **hurts performance** on random
+workloads. The branch predictor cannot help because random keys map to random
+slots, making the hint check a 50/50 branch that pollutes the branch
+predictor. SIMD-only scanning (match_tag + match_empty) is uniformly fast
+regardless of key distribution.
+
+**Structural benefit of removal**: Without the slot hint, inserts always
+append to the first empty slot. This guarantees that occupied slots are
+**packed contiguously from the beginning** of each group (no gaps). This
+invariant enables:
+- `count_occupied()`: a single `leading_zeros()` on the ctrl word replaces
+  bitmask scanning to find the next free slot or count entries
+- Simpler `insert_for_grow()`: just write at position `count_occupied()`
+- Simpler iteration: occupied slots are always `0..count_occupied()`
+- Simpler `sort_by_hash()`: no need to compact gaps before sorting
+
+**Current state**: Slot hint is fully removed. All paths use SIMD group
+scanning for lookups and `count_occupied()` for finding the insertion point.
 
 ### 8. Overflow Reserve Sizing ✅ Validated
 
@@ -159,13 +174,93 @@ entropy in both halves. Also changed trigram generation to use
 
 ## Summary of Impact
 
-| Change                     | Effect on insert time        |
-|----------------------------|------------------------------|
-| Capacity sizing fix        | **−50%** (biggest win)       |
-| Optimized growth path      | **−10%** on growth scenarios |
-| SIMD group scanning        | **−5%**                      |
-| Branch hints (scalar only) | **−2–6%**                    |
-| IdentityHasher fix         | Enabled fair comparison      |
+| Change                          | Effect                              |
+|---------------------------------|-------------------------------------|
+| Capacity sizing fix             | **−50%** insert time (biggest win)  |
+| Optimized growth path           | **2× faster** growth than hashbrown |
+| SIMD group scanning             | **−5%** insert time                 |
+| Slot hint removal               | **−25%** merge latency, contiguous packing |
+| Branch hints (scalar only)      | **−2–6%**                           |
+| IdentityHasher fix              | Enabled fair comparison             |
 
-The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts,
-**beats all hashbrown variants** on overwrites, and has **2× faster growth**.
+---
+
+## Benchmark Results (local x86_64 snapshot)
+
+Hardware used for the current local snapshot:
+
+- CPU: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
+- Architecture: x86_64
+- Topology: 1 socket, 1 core, 2 threads
+- CPU frequency range: 800 MHz to 2800 MHz
+- Memory: 7.8 GiB RAM
+
+### Insert (1000 trigrams, pre-sized)
+
+| Implementation       | Time (µs) | vs hashbrown |
+|----------------------|-----------|--------------|
+| FoldHashMap          | 13.88     | −5%          |
+| FxHashMap            | 14.60     | ~0%          |
+| hashbrown+Identity   | 14.44     | baseline     |
+| hashbrown::HashMap   | 14.55     | +1%          |
+| std::HashMap+FNV     | 15.55     | +8%          |
+| AHashMap             | 15.59     | +8%          |
+| **HashSortedMap**    | **9.40**  | **−35%**     |
+| std::HashMap         | 25.26     | +75%         |
+
+### Reinsert (1000 trigrams, all keys exist)
+
+| Implementation       | Time (µs) |
+|----------------------|-----------|
+| **HashSortedMap**    | **6.59**  |
+| hashbrown+Identity   | 6.95      |
+
+### Growth (128 → 1000 trigrams, 3 resize rounds)
+
+| Implementation       | Time (µs) |
+|----------------------|-----------|
+| hashbrown+Identity   | 26.66     |
+| **HashSortedMap**    | **27.50** |
+
+### Count (4000 trigrams, mixed insert/update)
+
+| Implementation                   | Time (µs) |
+|----------------------------------|-----------|
+| hashbrown+Identity entry()       | 15.49     |
+| **HashSortedMap get_or_default** | **15.88** |
+| **HashSortedMap entry().or_default()** | **16.15** |
+
+### Iteration (1000 trigrams)
+
+| Implementation                | Time (µs) |
+|-------------------------------|-----------|
+| **HashSortedMap iter()**      | **3.02**  |
+| hashbrown+Identity iter()     | 3.04      |
+| **HashSortedMap into_iter()** | **3.03**  |
+| hashbrown+Identity into_iter()| 3.56      |
+
+### Sort (100K trigrams)
+
+| Implementation              | Time (ms) |
+|-----------------------------|-----------|
+| **HashSortedMap sort_by_hash** | **1.66** |
+| Vec::sort_unstable          | 2.20      |
+
+### Merge (100 maps × 100K keys each → sorted output)
+
+| Implementation                    | Time (ms) | vs HSM merge+sort |
+|-----------------------------------|-----------|--------------------|
+| hashbrown merge presized          | 160.79    | +6%               |
+| **HashSortedMap merge presized**  | **117.01**| **−23%**          |
+| **HashSortedMap merge (no sort)** | **141.57**| **−7%**           |
+| hashbrown merge                   | 163.59    | +7%               |
+| **HashSortedMap merge + sort**    | **152.34**| **baseline**      |
+| hashbrown merge + Vec sort        | 193.37    | +27%              |
+| k-way merge sorted vecs           | 445       | +192%             |
+
+**Key takeaways:**
+- Pre-sized insert is **~35% faster** than hashbrown+Identity
+- Reinsert and iter paths are now close to parity with hashbrown+Identity
+- Growth path is currently **~3% slower** than hashbrown+Identity
+- sort_by_hash is **~24% faster** than Vec::sort_unstable
+- merge + sort is **~21% faster** than hashbrown merge + Vec sort
@@ -29,8 +29,8 @@ keys, which means:
 
 - **Overflow chaining** instead of open addressing — groups that fill up link
   to overflow groups rather than probing into neighbours.
-- **Slot hint** — a preferred slot index derived from the hash, checked before
-  scanning the group. Gives a direct hit on most inserts at low load.
+- **Contiguous packing** — occupied slots are always packed from position 0
+  with no gaps, enabling a single `leading_zeros()` to find the next free slot.
 - **SIMD group scanning** — uses NEON on aarch64, SSE2 on x86\_64, and a
   scalar fallback elsewhere to scan 8–16 control bytes in parallel.
 - **AoS group layout** — each group stores its control bytes, keys, and values
@@ -42,45 +42,32 @@ keys, which means:
 
 ## Benchmark results
 
-All benchmarks insert 1000 random trigram hashes (scrambled with
-`folded_multiply`) into maps with various configurations. Measured on Apple
-M-series (aarch64).
-
-### Insert 1000 trigrams — pre-sized, no growth
-
-| Rank | Map | Time (µs) | vs best |
-|------|-----|-----------|---------|
-| 🥇 | FoldHashMap | 2.44 | — |
-| 🥈 | FxHashMap | 2.61 | +7% |
-| 🥉 | hashbrown::HashMap | 2.67 | +9% |
-| 4 | **HashSortedMap** | **2.71** | +11% |
-| 5 | hashbrown+Identity | 2.74 | +12% |
-| 6 | std::HashMap+FNV | 3.27 | +34% |
-| 7 | AHashMap | 3.22 | +32% |
-| 8 | std::HashMap | 8.49 | +248% |
-
-### Re-insert same keys (all overwrites)
-
-| Map | Time (µs) |
-|-----|-----------|
-| **HashSortedMap** | **2.36** ✅ |
-| hashbrown+Identity | 2.58 |
-
-### Growth from small (`with_capacity(128)`, 3 resize rounds)
-
-| Map | Time (µs) | Growth penalty |
-|-----|-----------|----------------|
-| **HashSortedMap** | **4.85** | +2.14 |
-| hashbrown+Identity | 9.77 | +7.03 |
-
-### Key takeaways
-
-- **HashSortedMap matches the fastest hashbrown configurations** on pre-sized
-  first-time inserts and is **the fastest for overwrites**.
-- **Growth is ~2× faster** than hashbrown thanks to the optimized
-  `insert_for_grow` path that skips duplicate checking and uses raw copies.
-- The remaining gap to FoldHashMap (~11%) comes from foldhash's extremely
-  efficient hash function that pipelines well with hashbrown's SIMD scan.
+Latest local Criterion snapshot from this repository's
+`target/criterion` outputs (lower is better):
+
+Hardware used for this snapshot:
+
+- CPU: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
+- Architecture: x86_64
+- Topology: 1 socket, 1 core, 2 threads
+- CPU frequency range: 800 MHz to 2800 MHz
+- Memory: 7.8 GiB RAM
+
+| Scenario                                     | HashSortedMap | Comparison                             | Result      |
+| :------------------------------------------- | ------------: | :------------------------------------- | :---------- |
+| Insert 1000 trigrams (pre-sized)             |       9.40 µs | hashbrown::HashMap: 14.55 µs           | ~35% faster |
+| Grow from capacity 128                       |      27.50 µs | hashbrown+Identity: 26.66 µs           | ~3% slower  |
+| Count 4000 trigrams (`entry().or_default()`) |      16.15 µs | hashbrown+Identity `entry()`: 15.49 µs | ~4% slower  |
+| Iterate 1000 trigrams (`iter()`)             |       3.02 µs | hashbrown+Identity `iter()`: 3.04 µs   | ~1% faster  |
+| Sort 100000 trigrams by hash                 |       1.66 ms | `Vec::sort_unstable`: 2.20 ms          | ~24% faster |
+| Merge 100 sorted maps + final sort           |     152.34 ms | hashbrown merge + vec sort: 193.37 ms  | ~21% faster |
+
+Key takeaways:
+
+- Pre-sized inserts, sorting, and merge+sort remain the strongest paths.
+- Iteration is now roughly on par with `hashbrown+Identity`.
+- Growth and count/update workloads are currently slightly slower than
+  `hashbrown+Identity` in this run.
 
 ## Running
 

@@ -21,3 +21,4 @@ ahash = "0.8"
 hashbrown = "0.15"
 foldhash = "0.1"
 fnv = "1"
+itertools = "0.14"