Add a HybridReader for use in write constrained databases by jimhester · Pull Request #423 · posit-dev/ggsql

jimhester · 2026-05-01T20:54:17Z

Summary

Adds HybridReader, a Reader that composes any primary reader (the
"data" side) with an in-process DuckDBReader (the "staging" side).
register() writes go to staging; execute_sql routes queries that
mention any registered name to staging, and everything else to the
primary.

Behind the existing duckdb feature flag — no new feature, no new
dependencies.

Companion design comment with the broader sequencing context:
#341 (comment). Related to but implementation distinct from #422

Motivation

Some data sources are read-only by nature (Flight SQL servers, anonymous
Trino) or expensive to write to repeatedly during visualization
iteration (Snowflake). HybridReader composes a primary reader (the
remote data source) with a local DuckDB instance (staging). register()
writes to staging — sidestepping read-only or auth restrictions — while
execute_sql routes queries to the right side based on which tables
they reference. Same Reader interface; no caller-visible difference.

The design also pairs naturally with a query-result cache (PR3) that
memoizes remote query results in the staging DuckDB. The cache isn't in
this PR, but the staging plumbing it relies on is.

Design

HybridReader owns:

data: Box<dyn Reader + Send> — the primary backend.
staging: DuckDBReader — an in-process DuckDB instance.
staged_names: RefCell<HashSet<String>> — the names register() has
put into staging.

The routing predicate references_staged_name is a lightweight SQL
scanner — not a full parser. It checks whether any registered name
appears as a SQL identifier (with identifier-boundary respect, qualified
references like catalog.schema.name, double-quoted identifiers, and
single-quoted-string-literal exclusion). Comments are not currently
parsed: a stray identifier inside a -- comment could route a
primary-data query to staging, where it would fail with a clear error
rather than succeeding against the primary backend.

Reader::dialect() returns the staging DuckDB dialect, because all
internally-generated SQL (stat transforms, layer filters, temp-table
DDL) targets staging. Callers that need the primary's dialect (e.g.
schema introspection of the remote catalog) get it via the inherent
HybridReader::data_dialect() method.

Limitations (documented)

A single SQL statement cannot reference both staged names and
primary-data tables. Queries are dispatched whole; cross-backend joins
are unsupported. Materialize one side into staging first if you need to
combine them. There is a regression test pinning this behavior.

Staged data lives in the in-process DuckDB instance and is released
when the HybridReader is dropped — no spill-to-disk, no shared cache.

Testing

All tests are offline, no external setup:

Routing scanner (9 tests): empty registered-name set, no match,
single match, rejection of longer-identifier overlap (orders should
not match orders_detail), rejection of identifier-prefix overlap
(col should not match col_id), rejection of single-quoted-string
contents, match of double-quoted identifiers, match of qualified
references (catalog.schema.orders), and SQL-standard '' escape
inside a string literal.
Reader behavior (5 tests): register delegates to staging and
tracks the name; execute_sql routes a registered name to staging;
execute_sql routes an unregistered name to data; unregister
delegates and untracks; dialect() returns the staging dialect with
a discriminating SQLite-on-the-data-side setup.
Cross-side limitation (1 test): a query referencing both staged
and primary-only names routes wholly to staging and surfaces a
staging-side error rather than silently joining. The setup
discriminates correct routing from a wrong-route that would
otherwise succeed.

The dialect-discrimination test uses a SqliteReader for the data
side (Ansi CASE-form sql_greatest) against a DuckDB staging
(GREATEST(a, b)), so a regression that returned the data dialect
instead of staging's would fail visibly. Gated on the sqlite feature,
which is in upstream's default feature set.

What's next

Per the design comment, a follow-up PR adds:

PR3: A query-result cache in the staging DuckDB
(hybrid_cache.rs), a Reader::clear_cache() trait default,
Vega-Lite v5+v6 mime emission in the Jupyter kernel, and the
-- @uncache Jupyter meta-command.

The cache makes the iterate-on-remote case sub-millisecond on cache
hits while keeping the same Reader interface; it's gated by an env
var and fronted by a public CacheConfig for callers that want to
tune TTL or the byte budget.

Wraps any Reader (the data side) with an in-process DuckDBReader (the staging side). register() writes to staging; execute_sql routes whole queries to staging or the primary based on whether they reference any registered name. Behind the existing 'duckdb' feature. Tests cover the routing scanner (identifier-boundary checks, qualified references, double-quoted identifiers, string-literal exclusion), register/unregister name tracking, dialect dispatch, and the documented cross-side limitation.

Per code review: the original tests for routing direction and dialect selection used identical setup on both sides, so they passed regardless of impl correctness. The dialect test now uses a SqliteReader on the data side (SQLite dialect) so the staging-vs-data distinction surfaces in sql_greatest output, and the cross-side test now registers staged_only in both data and staging with different values so a wrong-route would succeed silently rather than erroring for the same reason as the correct route. Also corrects an inverted "false-negative" label and softens the misleading "comments are harmless" note in the references_staged_name doc-comment.

Decode embedded parquet bytes via Arrow and register through the `arrow` virtual table function instead of writing a temp file and calling `read_parquet`. The latter triggers DuckDB's autoloadable parquet extension, which fails in offline or network-restricted environments (observed as a flaky CI failure on `test_ribbon_transposed_vegalite_encoding`). Mirrors the loader path SqliteReader already uses.

# Conflicts: # CHANGELOG.md

jimhester · 2026-05-21T16:28:43Z

Any updates around this? I'd like to open a PR adding caching of the viz side, but it requires this PR. I can stack it on top of this PR if would would want to see how it works, let me know.

georgestagg · 2026-05-22T08:52:51Z

Hi Jim, sorry about the delayed review. It's my fault, I've been busy with other projects.

I don't feel equipped to make the decision alone on this, because the introduced primary/staging mechanism such a fundamentally different way to perform the computations required for visualisation. I am meeting with the team next week to discuss it further.

I am not saying "no". In fact I am sure we will have some form of this tiered approach go in. As you say, it's a requirement for local caching and will be a requirement especially when it comes to interactivity. However, I'm not yet convinced that the Reader is the right place for this mechanism to live long term, and there are some other questions I'd like to hash out with the team first.

Either way, I'll keep you updated, the PR has not been forgotten.

jimhester added 5 commits May 1, 2026 16:01

feat(reader): export HybridReader behind 'duckdb' feature

27a4b8b

style(hybrid): apply cargo fmt + clippy fixes

4df07d2

docs(changelog): announce HybridReader

138ddee

thomasp85 requested a review from georgestagg May 4, 2026 08:51

jimhester added 2 commits May 4, 2026 11:43

Merge remote-tracking branch 'upstream/main' into pr2-hybrid-reader

7aa76ae

# Conflicts: # CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a HybridReader for use in write constrained databases#423

Add a HybridReader for use in write constrained databases#423
jimhester wants to merge 7 commits into
posit-dev:mainfrom
jimhester:pr2-hybrid-reader

jimhester commented May 1, 2026 •

edited

Loading

Uh oh!

jimhester commented May 21, 2026

Uh oh!

georgestagg commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimhester commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Limitations (documented)

Testing

What's next

Uh oh!

jimhester commented May 21, 2026

Uh oh!

georgestagg commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimhester commented May 1, 2026 •

edited

Loading