Add a HybridReader for use in write constrained databases#423
Add a HybridReader for use in write constrained databases#423jimhester wants to merge 7 commits into
Conversation
Wraps any Reader (the data side) with an in-process DuckDBReader (the staging side). register() writes to staging; execute_sql routes whole queries to staging or the primary based on whether they reference any registered name. Behind the existing 'duckdb' feature. Tests cover the routing scanner (identifier-boundary checks, qualified references, double-quoted identifiers, string-literal exclusion), register/unregister name tracking, dialect dispatch, and the documented cross-side limitation.
Per code review: the original tests for routing direction and dialect selection used identical setup on both sides, so they passed regardless of impl correctness. The dialect test now uses a SqliteReader on the data side (SQLite dialect) so the staging-vs-data distinction surfaces in sql_greatest output, and the cross-side test now registers staged_only in both data and staging with different values so a wrong-route would succeed silently rather than erroring for the same reason as the correct route. Also corrects an inverted "false-negative" label and softens the misleading "comments are harmless" note in the references_staged_name doc-comment.
Decode embedded parquet bytes via Arrow and register through the `arrow` virtual table function instead of writing a temp file and calling `read_parquet`. The latter triggers DuckDB's autoloadable parquet extension, which fails in offline or network-restricted environments (observed as a flaky CI failure on `test_ribbon_transposed_vegalite_encoding`). Mirrors the loader path SqliteReader already uses.
# Conflicts: # CHANGELOG.md
|
Any updates around this? I'd like to open a PR adding caching of the viz side, but it requires this PR. I can stack it on top of this PR if would would want to see how it works, let me know. |
|
Hi Jim, sorry about the delayed review. It's my fault, I've been busy with other projects. I don't feel equipped to make the decision alone on this, because the introduced primary/staging mechanism such a fundamentally different way to perform the computations required for visualisation. I am meeting with the team next week to discuss it further. I am not saying "no". In fact I am sure we will have some form of this tiered approach go in. As you say, it's a requirement for local caching and will be a requirement especially when it comes to interactivity. However, I'm not yet convinced that the Reader is the right place for this mechanism to live long term, and there are some other questions I'd like to hash out with the team first. Either way, I'll keep you updated, the PR has not been forgotten. |
Summary
Adds
HybridReader, aReaderthat composes any primary reader (the"data" side) with an in-process
DuckDBReader(the "staging" side).register()writes go to staging;execute_sqlroutes queries thatmention any registered name to staging, and everything else to the
primary.
Behind the existing
duckdbfeature flag — no new feature, no newdependencies.
Companion design comment with the broader sequencing context:
#341 (comment). Related to but implementation distinct from #422
Motivation
Some data sources are read-only by nature (Flight SQL servers, anonymous
Trino) or expensive to write to repeatedly during visualization
iteration (Snowflake).
HybridReadercomposes a primary reader (theremote data source) with a local DuckDB instance (staging).
register()writes to staging — sidestepping read-only or auth restrictions — while
execute_sqlroutes queries to the right side based on which tablesthey reference. Same
Readerinterface; no caller-visible difference.The design also pairs naturally with a query-result cache (PR3) that
memoizes remote query results in the staging DuckDB. The cache isn't in
this PR, but the staging plumbing it relies on is.
Design
HybridReaderowns:data: Box<dyn Reader + Send>— the primary backend.staging: DuckDBReader— an in-process DuckDB instance.staged_names: RefCell<HashSet<String>>— the namesregister()hasput into staging.
The routing predicate
references_staged_nameis a lightweight SQLscanner — not a full parser. It checks whether any registered name
appears as a SQL identifier (with identifier-boundary respect, qualified
references like
catalog.schema.name, double-quoted identifiers, andsingle-quoted-string-literal exclusion). Comments are not currently
parsed: a stray identifier inside a
-- commentcould route aprimary-data query to staging, where it would fail with a clear error
rather than succeeding against the primary backend.
Reader::dialect()returns the staging DuckDB dialect, because allinternally-generated SQL (stat transforms, layer filters, temp-table
DDL) targets staging. Callers that need the primary's dialect (e.g.
schema introspection of the remote catalog) get it via the inherent
HybridReader::data_dialect()method.Limitations (documented)
A single SQL statement cannot reference both staged names and
primary-data tables. Queries are dispatched whole; cross-backend joins
are unsupported. Materialize one side into staging first if you need to
combine them. There is a regression test pinning this behavior.
Staged data lives in the in-process DuckDB instance and is released
when the
HybridReaderis dropped — no spill-to-disk, no shared cache.Testing
All tests are offline, no external setup:
single match, rejection of longer-identifier overlap (
ordersshouldnot match
orders_detail), rejection of identifier-prefix overlap(
colshould not matchcol_id), rejection of single-quoted-stringcontents, match of double-quoted identifiers, match of qualified
references (
catalog.schema.orders), and SQL-standard''escapeinside a string literal.
registerdelegates to staging andtracks the name;
execute_sqlroutes a registered name to staging;execute_sqlroutes an unregistered name to data;unregisterdelegates and untracks;
dialect()returns the staging dialect witha discriminating SQLite-on-the-data-side setup.
and primary-only names routes wholly to staging and surfaces a
staging-side error rather than silently joining. The setup
discriminates correct routing from a wrong-route that would
otherwise succeed.
The dialect-discrimination test uses a
SqliteReaderfor the dataside (Ansi
CASE-formsql_greatest) against a DuckDB staging(
GREATEST(a, b)), so a regression that returned the data dialectinstead of staging's would fail visibly. Gated on the
sqlitefeature,which is in upstream's default feature set.
What's next
Per the design comment, a follow-up PR adds:
(
hybrid_cache.rs), aReader::clear_cache()trait default,Vega-Lite v5+v6 mime emission in the Jupyter kernel, and the
-- @uncacheJupyter meta-command.The cache makes the iterate-on-remote case sub-millisecond on cache
hits while keeping the same
Readerinterface; it's gated by an envvar and fronted by a public
CacheConfigfor callers that want totune TTL or the byte budget.