Skip to content

[common] Introduce N-gram file index for query#7927

Open
xuzifu666 wants to merge 2 commits into
apache:masterfrom
xuzifu666:n_gram_index
Open

[common] Introduce N-gram file index for query#7927
xuzifu666 wants to merge 2 commits into
apache:masterfrom
xuzifu666:n_gram_index

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

Purpose

Currently Paimon not support N-gram file index, so there is room for improvement in scenarios involving prefix and suffix queries.
Let me briefly explain the principles and workflow of the n-gram file index within this PR:

┌─────────────────────────────────────────────────────────────────────────────────┐
   │  1. Overall Architecture (Integration with Paimon FileIndexer Framework)        │
   └─────────────────────────────────────────────────────────────────────────────────┘

                               FileIndexer Interface
                                       │
                       ┌───────────────┼───────────────┐
                       │               │               │
               BloomFilter         Bitmap          N-gram ⭐
                  Index             Index           Index
                (equality)        (equality)    (prefix/suffix)

                       N-gram File Index
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
       Writer           Factory            Reader
      (Build)        (SPI Creation)       (Query Filter)
           │                  │                  │
           ▼                  ▼                  ▼

      Write Data  →   NgramFileIndex    →   Query Filter
      Generate N-gram  (Core Impl)       Apply Predicates
      Store HashSet    gram_size param    REMAIN/SKIP


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  2. Index Build Process (Writing Phase)                                         │
   └─────────────────────────────────────────────────────────────────────────────────┘

              Input Rows
                   │
                   ▼
       ┌──────────────────────┐
       │ write("hello")       │
       │ write("world")       │
       └────────────┬─────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ 1. BinaryString → String conversion  │
       │ 2. Extract N-grams                   │
       │    "hello" → {he, el, ll, lo}        │
       │    "world" → {wo, or, rl, ld}        │
       │ 3. Add to HashSet                    │
       └────────────┬─────────────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ Final N-gram Set:                    │
       │ {he, el, ll, lo, wo, or, rl, ld}     │
       │                                      │
       │ Size: 680 bytes (100K records)       │
       │ Compression ratio: 0.03%             │
       └────────────┬─────────────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ Serialization Format:                │
       │ [4B gramSize][4B setSize]            │
       │ [2B len1][N bytes token1]            │
       │ [2B len2][N bytes token2]            │
       │ ...                                  │
       └────────────┬─────────────────────────┘
                    │
                    ▼
              Index Bytes
           (Written to file)


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  3. Query Execution Flow (Reading & Filtering Phase)                            │
   └─────────────────────────────────────────────────────────────────────────────────┘

       SQL Query
         │
         ├─ LIKE 'he%'
         ├─ LIKE '%lo'
         ├─ LIKE '%ll%'
         └─ = 'hello'

            ▼
       ┌────────────────────────────────┐
       │ Predicate Optimization         │
       │ LIKE 'prefix%'                 │
       │   → StartsWith("prefix")       │
       │ LIKE '%suffix'                 │
       │   → EndsWith("suffix")         │
       │ LIKE '%middle%'                │
       │   → Contains("middle")         │
       └────────────┬───────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ FileIndexPredicate.evaluate()        │
       │ Iterate over each data file          │
       └────────────┬─────────────────────--──┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ visitStartsWith(fieldRef, "he")      │
       │                                      │
       │ 1. Get query pattern: "he"           │
       │ 2. Generate N-grams: {he}            │
       │ 3. Check each against index set      │
       │                                      │
       │ Check "he" ∈ {he,el,ll,lo,...}?      │
       └────────────┬─────────────────────--──┘
                    │
           ┌────────┴────────┐
           │                 │
           ▼                 ▼
          YES               NO
           │                 │
           ▼                 ▼
       ┌──────────────┐  ┌──────────────┐
       │ REMAIN       │  │ SKIP         │
       │ File might   │  │ File cannot  │
       │ contain data │  │ contain data │
       │ (scan rows)  │  │ (skip file)  │
       └──────────────┘  └──────────────┘
           │                 │
           └────────┬────────┘
                    │
                    ▼
           Merge results & row-level scan


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  4. Filter Decision Logic (Decision Tree)                                       │
   └─────────────────────────────────────────────────────────────────────────────────┘

       Query pattern: pattern
               │
               ▼
       ┌──────────────────────────────┐
       │ pattern == null?             │ YES ──► REMAIN (conservative)
       │ pattern.isEmpty()?           │ YES ──► REMAIN (conservative)
       │ pattern.length < gramSize?   │ YES ──► REMAIN (cannot judge)
       └──────────────┬───────────────┘
                      │ NO
                      ▼
       ┌──────────────────────────────┐
       │ FOR i = 0 TO pattern.length  │
       │     ngram = pattern[i:i+g]   │
       │     IF ngram ∉ ngramSet:     │
       │         RETURN SKIP          │ Early exit (99% case)
       │ RETURN REMAIN               │
       └──────────────────────────────┘


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  5. Data Flow Diagram (From Data to Filter Result)                              │
   └─────────────────────────────────────────────────────────────────────────────────┘

       Input Data (100K rows)
               │
               ▼
       ┌──────────────────────────┐
       │ NgramFileIndex.Writer    │ (38 ms)
       │ Build index              │ 2,631 rows/ms
       └──────────────┬───────────┘
                      │
                      ▼
       ┌──────────────────────────┐
       │ Index Bytes (680 bytes)  │
       │ {N-gram set serialized}  │
       └──────────────┬───────────┘
                      │
           ┌──────────┴──────────┐
           │                     │
           ▼                     ▼
       File 1              File 1000
       Index segment       Index segment
           │                     │
           ▼ (25 µs)            ▼ (25 µs)
       ┌─────────────┐    ┌─────────────┐
       │ visitXxx()  │    │ visitXxx()  │
       │ REMAIN/SKIP │    │ REMAIN/SKIP │
       └─────────────┘    └─────────────┘
           │                     │
           └──────────┬──────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │ File-level filter result     │
       │ - REMAIN: 100 files          │
       │ - SKIP: 900 files            │
       │ Skipped 900/1000 files       │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │ Row-level scan (REMAIN only) │
       │ 100 files × 100K rows        │
       │ = 10M rows (vs 100M without) │
       │ Reduced 90%                  │
       └──────────────────────────────┘

benchmark test result:

   ┌────────────────────────────────────────────────────────────────────────────────┐
   │ REAL-WORLD PERFORMANCE GAINS (Scenario: Query 1,000 files, 100K rows each)   │
   ├────────────────────────────────────────────────────────────────────────────────┤
   │                                                                                │
   │  No Index                                 With N-gram Index                   │
   │  ─────────────────────────────────────    ─────────────────────────────────   │
   │  • Scan: 1,000 files × 100K rows         • Index: 1,000 × 25µs = 25ms       │
   │  • Total: 100 million rows               • Scan: 100 files × 100K rows      │
   │  • Latency: ~100 ms                      • Total: 10 million rows           │
   │  • Files Scanned: 100%                   • Latency: ~26 ms                 │
   │                                          • Files Scanned: 10%               │
   │                                                                                │
   │  IMPROVEMENT: 74% faster | 90% fewer rows scanned | 99% I/O reduction       │
   │                                                                                │
   └────────────────────────────────────────────────────────────────────────────────┘

The current solution does not employ a Bloom filter, primarily to avoid the issue of false positives.

Tests

NgramFileIndexSimpleTest.java
NgramFileIndexTest.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant