[common] Introduce N-gram file index for query by xuzifu666 · Pull Request #7927 · apache/paimon

xuzifu666 · 2026-05-21T14:22:28Z

Purpose

Currently Paimon not support N-gram file index, so there is room for improvement in scenarios involving prefix and suffix queries.
Let me briefly explain the principles and workflow of the n-gram file index within this PR：

┌─────────────────────────────────────────────────────────────────────────────────┐
   │  1. Overall Architecture (Integration with Paimon FileIndexer Framework)        │
   └─────────────────────────────────────────────────────────────────────────────────┘

                               FileIndexer Interface
                                       │
                       ┌───────────────┼───────────────┐
                       │               │               │
               BloomFilter         Bitmap          N-gram ⭐
                  Index             Index           Index
                (equality)        (equality)    (prefix/suffix)

                       N-gram File Index
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
       Writer           Factory            Reader
      (Build)        (SPI Creation)       (Query Filter)
           │                  │                  │
           ▼                  ▼                  ▼

      Write Data  →   NgramFileIndex    →   Query Filter
      Generate N-gram  (Core Impl)       Apply Predicates
      Store HashSet    gram_size param    REMAIN/SKIP


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  2. Index Build Process (Writing Phase)                                         │
   └─────────────────────────────────────────────────────────────────────────────────┘

              Input Rows
                   │
                   ▼
       ┌──────────────────────┐
       │ write("hello")       │
       │ write("world")       │
       └────────────┬─────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ 1. BinaryString → String conversion  │
       │ 2. Extract N-grams                   │
       │    "hello" → {he, el, ll, lo}        │
       │    "world" → {wo, or, rl, ld}        │
       │ 3. Add to HashSet                    │
       └────────────┬─────────────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ Final N-gram Set:                    │
       │ {he, el, ll, lo, wo, or, rl, ld}     │
       │                                      │
       │ Size: 680 bytes (100K records)       │
       │ Compression ratio: 0.03%             │
       └────────────┬─────────────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ Serialization Format:                │
       │ [4B gramSize][4B setSize]            │
       │ [2B len1][N bytes token1]            │
       │ [2B len2][N bytes token2]            │
       │ ...                                  │
       └────────────┬─────────────────────────┘
                    │
                    ▼
              Index Bytes
           (Written to file)


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  3. Query Execution Flow (Reading & Filtering Phase)                            │
   └─────────────────────────────────────────────────────────────────────────────────┘

       SQL Query
         │
         ├─ LIKE 'he%'
         ├─ LIKE '%lo'
         ├─ LIKE '%ll%'
         └─ = 'hello'

            ▼
       ┌────────────────────────────────┐
       │ Predicate Optimization         │
       │ LIKE 'prefix%'                 │
       │   → StartsWith("prefix")       │
       │ LIKE '%suffix'                 │
       │   → EndsWith("suffix")         │
       │ LIKE '%middle%'                │
       │   → Contains("middle")         │
       └────────────┬───────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ FileIndexPredicate.evaluate()        │
       │ Iterate over each data file          │
       └────────────┬─────────────────────--──┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ visitStartsWith(fieldRef, "he")      │
       │                                      │
       │ 1. Get query pattern: "he"           │
       │ 2. Generate N-grams: {he}            │
       │ 3. Check each against index set      │
       │                                      │
       │ Check "he" ∈ {he,el,ll,lo,...}?      │
       └────────────┬─────────────────────--──┘
                    │
           ┌────────┴────────┐
           │                 │
           ▼                 ▼
          YES               NO
           │                 │
           ▼                 ▼
       ┌──────────────┐  ┌──────────────┐
       │ REMAIN       │  │ SKIP         │
       │ File might   │  │ File cannot  │
       │ contain data │  │ contain data │
       │ (scan rows)  │  │ (skip file)  │
       └──────────────┘  └──────────────┘
           │                 │
           └────────┬────────┘
                    │
                    ▼
           Merge results & row-level scan


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  4. Filter Decision Logic (Decision Tree)                                       │
   └─────────────────────────────────────────────────────────────────────────────────┘

       Query pattern: pattern
               │
               ▼
       ┌──────────────────────────────┐
       │ pattern == null?             │ YES ──► REMAIN (conservative)
       │ pattern.isEmpty()?           │ YES ──► REMAIN (conservative)
       │ pattern.length < gramSize?   │ YES ──► REMAIN (cannot judge)
       └──────────────┬───────────────┘
                      │ NO
                      ▼
       ┌──────────────────────────────┐
       │ FOR i = 0 TO pattern.length  │
       │     ngram = pattern[i:i+g]   │
       │     IF ngram ∉ ngramSet:     │
       │         RETURN SKIP          │ Early exit (99% case)
       │ RETURN REMAIN               │
       └──────────────────────────────┘


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  5. Data Flow Diagram (From Data to Filter Result)                              │
   └─────────────────────────────────────────────────────────────────────────────────┘

       Input Data (100K rows)
               │
               ▼
       ┌──────────────────────────┐
       │ NgramFileIndex.Writer    │ (38 ms)
       │ Build index              │ 2,631 rows/ms
       └──────────────┬───────────┘
                      │
                      ▼
       ┌──────────────────────────┐
       │ Index Bytes (680 bytes)  │
       │ {N-gram set serialized}  │
       └──────────────┬───────────┘
                      │
           ┌──────────┴──────────┐
           │                     │
           ▼                     ▼
       File 1              File 1000
       Index segment       Index segment
           │                     │
           ▼ (25 µs)            ▼ (25 µs)
       ┌─────────────┐    ┌─────────────┐
       │ visitXxx()  │    │ visitXxx()  │
       │ REMAIN/SKIP │    │ REMAIN/SKIP │
       └─────────────┘    └─────────────┘
           │                     │
           └──────────┬──────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │ File-level filter result     │
       │ - REMAIN: 100 files          │
       │ - SKIP: 900 files            │
       │ Skipped 900/1000 files       │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │ Row-level scan (REMAIN only) │
       │ 100 files × 100K rows        │
       │ = 10M rows (vs 100M without) │
       │ Reduced 90%                  │
       └──────────────────────────────┘

benchmark test result：

   ┌────────────────────────────────────────────────────────────────────────────────┐
   │ REAL-WORLD PERFORMANCE GAINS (Scenario: Query 1,000 files, 100K rows each)   │
   ├────────────────────────────────────────────────────────────────────────────────┤
   │                                                                                │
   │  No Index                                 With N-gram Index                   │
   │  ─────────────────────────────────────    ─────────────────────────────────   │
   │  • Scan: 1,000 files × 100K rows         • Index: 1,000 × 25µs = 25ms       │
   │  • Total: 100 million rows               • Scan: 100 files × 100K rows      │
   │  • Latency: ~100 ms                      • Total: 10 million rows           │
   │  • Files Scanned: 100%                   • Latency: ~26 ms                 │
   │                                          • Files Scanned: 10%               │
   │                                                                                │
   │  IMPROVEMENT: 74% faster | 90% fewer rows scanned | 99% I/O reduction       │
   │                                                                                │
   └────────────────────────────────────────────────────────────────────────────────┘

The current solution does not employ a Bloom filter, primarily to avoid the issue of false positives.

Tests

NgramFileIndexSimpleTest.java
NgramFileIndexTest.java

…s queries

xuzifu666 added 2 commits May 20, 2026 23:24

[common] Introduce N-gram file index for string prefix/suffix/contain…

8fe1c54

…s queries

add benchmark test

ccbd6c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[common] Introduce N-gram file index for query#7927

[common] Introduce N-gram file index for query#7927
xuzifu666 wants to merge 2 commits into
apache:masterfrom
xuzifu666:n_gram_index

xuzifu666 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xuzifu666 commented May 21, 2026

Purpose

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant