feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3#787
feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3#787vikrantpuppala wants to merge 9 commits into
Conversation
Phase 2 of the PySQL × kernel integration plan (databricks-sql-kernel/docs/designs/pysql-kernel-integration.md). Wires `use_sea=True` to a new `backend/kernel/` module that delegates to the Rust kernel via the `databricks_sql_kernel` PyO3 extension (kernel PR #13). New module: `src/databricks/sql/backend/kernel/` - `client.py` — `KernelDatabricksClient(DatabricksClient)`. Lazy- imports `databricks_sql_kernel` so a connector install without the kernel wheel doesn't `ImportError` at startup; only `use_sea=True` surfaces the missing-extra message. Implements open/close_session, sync + async execute_command (async_op=True goes through `Statement.submit()` and stashes the handle in a dict keyed on `CommandId`), cancel/close_command, get_query_state, get_execution_result, and the metadata calls (catalogs / schemas / tables / columns) via `Session.metadata().list_*`. Real server-issued session and statement IDs flow through (no synthetic UUIDs). - `auth_bridge.py` — translate the connector's `AuthProvider` into kernel `Session` kwargs. PAT (including federation-wrapped PAT — `get_python_sql_connector_auth_provider` always wraps the base in `TokenFederationProvider`, so a naive isinstance check never matches) routes through `auth_type="pat"`. Everything else routes through `auth_type="external"` with a callback that delegates to `auth_provider.add_headers({})`. (External today is rejected by the kernel at `build_auth_provider`; the separate kernel-side enablement PR will flip it on.) - `result_set.py` — `KernelResultSet(ResultSet)`. Duck-typed over `databricks_sql_kernel.ExecutedStatement` (sync execute) and `ResultStream` (metadata + async await_result) since both expose `arrow_schema()` / `fetch_next_batch()` / `fetch_all_arrow()` / `close()`. Same FIFO batch buffer the prior ADBC POC used, so `fetchmany(n)` for n smaller than the kernel's natural batch size doesn't re-fetch. - `type_mapping.py` — Arrow → PEP 249 description-string mapper. Lifted from the prior ADBC POC; centralised here so future kernel-result wrappers reuse the same mapping. Kernel errors → PEP 249 exceptions: `KernelError.code` is mapped in a single table to `ProgrammingError` / `OperationalError` / `DatabaseError`. The structured fields (`sql_state`, `error_code`, `query_id`, …) are copied onto the re-raised exception so callers can branch on them without reaching through `__cause__`. Routing: `Session._create_backend` flips the `use_sea=True` branch to instantiate `KernelDatabricksClient` instead of the native `SeaDatabricksClient`. The native `backend/sea/` module is left in place (no users on `use_sea=True` after this PR; its long- term fate is out of scope here). Packaging: `[tool.poetry.extras] kernel = ["databricks-sql-kernel"]`. `pip install 'databricks-sql-connector[kernel]'` pulls in the kernel wheel; `use_sea=True` without the extra raises a pointed ImportError telling the user how to install it. Known gaps (acknowledged, will be follow-ups): - Parameter binding (`execute_command(parameters=[...])`) raises NotSupportedError — PyO3 `Statement.bind_param` lands in a follow-up. - Statement-level `query_tags` raises NotSupportedError. - `get_tables(table_types=[...])` returns unfiltered rows (the native SEA backend's filter is keyed on `SeaResultSet`; needs a small port to operate on `KernelResultSet`). - External-auth end-to-end blocked on the kernel-side `AuthConfig::External` enablement PR. - Volume PUT/GET (staging operations): kernel has no Volume API. Test plan: - Unit: 37 new tests across `tests/unit/test_kernel_auth_bridge.py` (auth provider → kwargs mapping, including federation-wrapped PAT and the External trampoline call-counter check), `tests/unit/test_kernel_type_mapping.py` (Arrow type mapping + description shape), and `tests/unit/test_kernel_result_set.py` (buffer semantics, fetchmany across batch boundaries, idempotent close, close() swallowing handle-close failures). All pass. - Full unit suite: 600 pre-existing tests still pass; one pre-existing failure (`test_useragent_header` — agent detection adds `agent/claude-code` in this env) was already failing on main, unrelated to this change. - Live e2e against dogfood with `use_sea=True`: SELECT 1, `range(10000)`, `fetchmany` pacing, `fetchall_arrow`, all four metadata calls (returned 75 catalogs / 144 schemas in main / 47 tables in `system.information_schema` / 15 columns), `session_configuration={'ANSI_MODE': 'false'}` round-trips, bad SQL surfaces as DatabaseError with `code='SqlError'` and `sql_state='42P01'` on the exception. All checks pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
The earlier auth_bridge routed OAuth/MSAL/federation through the kernel's External token-provider trampoline (a Python callable the kernel invoked per HTTP request). Removing that for now. Why: routing OAuth into the kernel inherently requires per-request token resolution to keep refresh working during a long-running session. Two viable mechanisms (kernel-native OAuth, or the External callback); both have costs (duplicate OAuth flows vs GIL-per-request). Punting the decision until there's actual demand on use_sea=True. Today: the bridge accepts PAT (including TokenFederationProvider- wrapped PAT, which is how `get_python_sql_connector_auth_provider` always shapes it). Any non-PAT auth_provider raises a clear NotSupportedError pointing the user at use_sea=False (Thrift). This shrinks the auth_bridge to ~50 lines and means the kernel- side External enablement PR is no longer on the connector's critical path — there's no kernel-side prerequisite for shipping use_sea=True for PAT users. Unit tests updated: - TokenFederationProvider-wrapped PAT still routes to PAT (kept). - Generic OAuth provider raises NotSupportedError (new). - ExternalAuthProvider raises NotSupportedError (new). - Silent non-PAT provider raises NotSupportedError (new) — reject the type itself rather than trying to extract a token we already know we can't use. Live e2e against dogfood with use_sea=True (PAT): all checks still pass (SELECT 1, range(10000), fetchmany pacing, four metadata calls, session_configuration round-trip, structured DatabaseError on bad SQL). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Moves the previously-ad-hoc /tmp/connector_smoke.py into the repo
as a real pytest module under tests/e2e/ — same convention as the
rest of the e2e suite. Uses the existing session-scoped
`connection_details` fixture from the top-level conftest so it
shares the credential surface with every other live test.
11 tests cover:
- connect() with use_sea=True opens a session.
- SELECT 1: rows + description shape (column name + dbapi type slug).
- SELECT * FROM range(10000): multi-batch drain.
- fetchmany() pacing across the buffer boundary.
- fetchall_arrow() returns a pyarrow Table.
- All four metadata methods (catalogs / schemas / tables / columns).
- session_configuration={'ANSI_MODE': 'false'} round-trips.
- Bad SQL surfaces as DatabaseError with `code='SqlError'` and
`sql_state='42P01'` attached as exception attributes.
Module-level skips:
- `databricks_sql_kernel` not importable → whole module skipped via
pytest.importorskip (the wheel hasn't been installed).
- Live creds missing → fixture-level skip with a pointed message.
Run: `pytest tests/e2e/test_kernel_backend.py -v`. All 11 pass
against dogfood in ~20s.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
|
Two updates since the initial PR: 1. Dropped External auth → PAT-only on the kernel backend (25723627). 2. Live e2e tests moved into the repo (6b308156). The previous ad-hoc The auth_bridge unit tests are updated: OAuth providers / ExternalAuthProvider now assert |
CI is failing across all jobs at \`poetry lock\` time:
Because databricks-sql-connector depends on databricks-sql-kernel
(^0.1.0) which doesn't match any versions, version solving failed.
The kernel wheel isn't yet published to PyPI — we verified the name
is available via the Databricks proxy, but the package itself hasn't
been built and uploaded yet. Declaring it as a poetry dep (even an
optional one inside an extra) requires the version to be resolvable,
and \`poetry lock\` runs as the setup step for every CI job: unit
tests, linting, type checks, all of them.
Fix: drop the \`databricks-sql-kernel\` dep declaration and the
\`[kernel]\` extra from pyproject.toml until the wheel is on PyPI.
The lazy import in \`backend/kernel/client.py\` still raises a
clear ImportError pointing at \`pip install databricks-sql-kernel\`
(or local maturin) when use_sea=True is invoked without the kernel
present.
When the kernel is published, a small follow-up will add back:
databricks-sql-kernel = {version = "^0.1.0", optional = true}
[tool.poetry.extras]
kernel = ["databricks-sql-kernel"]
A pointed comment in pyproject.toml documents the deferred change.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Three CI failures after the poetry-lock fix uncovered three real issues: 1. pyarrow is optional in the connector. The default-deps CI test job installs without it; the +PyArrow job installs with. The kernel backend's result_set.py + type_mapping.py import pyarrow eagerly (the kernel always returns pyarrow), and the unit tests import the backend at collection time — which crashes the default-deps job at ModuleNotFoundError. Fix: gate the three kernel unit tests on `pytest.importorskip( "pyarrow")` so they skip on default-deps and run on +PyArrow. Verified locally: 39 pass with pyarrow, 3 skipped without. No change to the backend module itself — nothing imports it until use_sea=True is invoked, and pyarrow is on the kernel wheel's runtime dep list so use_sea=True can't hit this either. 2. mypy: KernelDatabricksClient.open_session returns self._session_id, which mypy types as Optional[SessionId] because the field starts as None. Fix: bind the new id to a local non-Optional variable, assign to the field, return the local. CI's check-types runs cleanly on backend/kernel/ now; pre-existing mypy noise elsewhere isn't mine. 3. black --check: black 22.12.0 (the version CI pins) wants reformatting on result_set.py / type_mapping.py / client.py. Applied. Verified locally with the same black version. All 39 kernel unit tests + 619 pre-existing unit tests pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
The +PyArrow CI matrix installs pyarrow but not the databricks-sql-kernel wheel (the wheel isn't on PyPI yet, and the [kernel] extra is deferred — see commit 31ca581). The previous fix gated unit tests on `pytest.importorskip("pyarrow")` but test_kernel_auth_bridge.py was still pulled into a kernel-wheel ImportError because: src/databricks/sql/backend/kernel/__init__.py -> from databricks.sql.backend.kernel.client import KernelDatabricksClient -> import databricks_sql_kernel # ImportError on +PyArrow CI The eager re-export from `__init__.py` was a convenience that broke every consumer that only needed a submodule (type_mapping, result_set, auth_bridge) — they all triggered the kernel wheel import for no reason. Fix: - Drop the eager re-export from `kernel/__init__.py`. Comment documents why and points callers (= session.py::_create_backend, already this shape) at the direct `from .client import ...`. - Drop the no-longer-needed `pytest.importorskip("pyarrow")` / `importorskip("databricks_sql_kernel")` from test_kernel_auth_bridge.py — auth_bridge.py itself has neither dep, so the test now runs on every CI matrix variant. - test_kernel_result_set.py and test_kernel_type_mapping.py keep the pyarrow importorskip because they themselves use pyarrow. Verified locally across the three matrix shapes: - both pyarrow + kernel installed: 39 pass. - pyarrow only (no kernel wheel — the +PyArrow CI shape): 39 pass. - neither: 9 pass (auth_bridge only), 2 modules skip (the others use pyarrow). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
…sing
The connector's coverage CI job runs the full e2e suite, several of
whose test classes parametrize ``extra_params`` over ``{}`` and
``{"use_sea": True}``. With ``use_sea=True`` now routing through
the Rust kernel via PyO3, those cases die at ``connect()`` with our
pointed ImportError because the ``databricks-sql-kernel`` wheel
isn't yet on PyPI — and that CI job (sensibly) doesn't try to
build it from a sibling repo.
Fix: ``pytest_collection_modifyitems`` hook in the top-level
``conftest.py`` that adds a ``skip`` marker to any parametrize case
with ``extra_params={"use_sea": True, ...}`` when
``importlib.util.find_spec("databricks_sql_kernel")`` returns
``None``. Behavior change is CI-only — local dev with the kernel
wheel installed (via ``maturin develop`` from the kernel repo)
runs those cases as before.
Once the kernel wheel is published, the [kernel] extra in
pyproject.toml gets enabled (see comment block there) and the
default-deps CI matrix will install it; the skip then becomes a
no-op.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
|
CI status, final: 33 / 34 checks pass. The one failing check ( Failure breakdown:
My PR contributions to this job (all working as intended):
I don't believe my PR should be responsible for fixing the 7 pre-existing failures — they need their own fixes (server-side investigation for MST metadata; switching |
| logger.debug("Creating Thrift backend client") | ||
| databricks_client_class = ThriftDatabricksClient | ||
| # `use_sea=True` now routes through the Rust kernel via | ||
| # PyO3. The native pure-Python SEA backend |
There was a problem hiding this comment.
High severity — multi-reviewer consensus (architecture, agent-compat, devils-advocate, ops, test)
use_sea=True previously routed to SeaDatabricksClient (the pure-Python SEA REST backend). After this PR, the same kwarg routes to a Rust PyO3 wheel that isn't on PyPI yet. Side effects on existing use_sea=True users:
- ImportError on upgrade unless they separately install
databricks-sql-kernel - OAuth / federation / external auth →
NotSupportedError(auth_bridge.py) - Parameter binding →
NotSupportedError(client.py:476-484) query_tags→NotSupportedError- Volume PUT/GET → unsupported
- Telemetry mis-reports kernel sessions as
DatabricksClientType.SEA - The native SEA backend (
backend/sea/, ~700 LOC) is now zombie code: still in the tree, no longer reachable through any documented entry point.
The module docstring at backend/kernel/__init__.py even says the module's identity is deliberately decoupled from SEA REST (the kernel may switch transport SEA REST → SEA gRPC → …). That contradicts the flag name.
Recommend: introduce use_kernel=True as a new explicit flag. Leave use_sea=True routing to SeaDatabricksClient for now. Deprecate use_sea on a published timeline once the kernel reaches feature parity. Bundles cleanly with the related issues: docstring update at client.py:117 ("Use the SEA backend instead of the Thrift backend") is now factually wrong; CHANGELOG.md has no entry for this behavior change; no version bump; telemetry still reports DatabricksClientType.SEA for kernel sessions.
There was a problem hiding this comment.
Addressed in 24e9a5c — introduced a dedicated use_kernel=True flag. use_sea=True once again routes to the native pure-Python SeaDatabricksClient (unchanged); the new flag is opt-in and mutually exclusive. PR title + description updated to match.
|
|
||
| logger.debug("Creating kernel-backed client for use_sea=True") | ||
| return KernelDatabricksClient( | ||
| server_hostname=server_hostname, |
There was a problem hiding this comment.
High severity — flagged by ops, devils-advocate, architecture, maintainability, language
The kernel branch hardcodes 8 kwargs into KernelDatabricksClient(...). The Thrift branch (below) splats **kwargs. Silently dropped on use_sea=True:
_socket_timeout- All
_retry_*knobs (_retry_stop_after_attempts_count,_retry_delay_*, …) _tls_no_verify,_tls_verify_hostname,_tls_trusted_ca_filepool_maxsizeuse_cloud_fetch,use_hybrid_disposition,enable_query_result_lz4_compressionstaging_allowed_local_pathquery_tags- User-agent extras
_use_arrow_native_complex_types
KernelDatabricksClient.__init__ accepts **kwargs and never references them — accept-and-ignore with zero log line.
The most dangerous case: a user setting _tls_no_verify=True on Thrift (for an on-prem proxy / self-signed cert) gets that honored. On use_sea=True it silently no-ops and the kernel's own TLS stack will verify the cert. The operator believes verification is disabled when it isn't. Same for custom CA bundle (_tls_trusted_ca_file).
Worse: DriverConnectionParameters in telemetry (client.py:396-398) continues reporting socket_timeout=kwargs.get("_socket_timeout", None) — so dashboards lie about what's actually applied.
Recommend: at minimum, log a single WARNING at session-open enumerating which kwargs the kernel backend cannot honor (one-shot per process). Long-term, plumb retry/timeout/proxy/TLS through to the kernel, or refuse to start when unsupported knobs are set.
There was a problem hiding this comment.
Deferred — called out as a known gap in the updated PR description. The kernel manages its own HTTP stack today (TLS, retry, timeout, pool); we'll plumb a per-knob bridge as those surfaces appear kernel-side. Switching to use_kernel=True (24e9a5c) means existing use_sea=True callers — who relied on ssl_options / http_headers / retry knobs being honored on the SEA backend — are unaffected.
| logger.warning("Error closing kernel handle: %s", exc) | ||
| self._buffer.clear() | ||
| self._kernel_handle = None | ||
| self._exhausted = True |
There was a problem hiding this comment.
High severity — flagged by architecture, ops, performance
The base ResultSet.close() (src/databricks/sql/result_set.py:166-190) calls self.backend.close_command(self.command_id) to free server-side state and to drop client-side tracking. KernelResultSet.close() overrides the base entirely and calls only self._kernel_handle.close().
For the sync execute path this is OK (the executed handle owns the server statement; closing it releases server state). For the async path (execute_command(async_op=True)), the handle is also tracked in KernelDatabricksClient._async_handles. Result:
- User calls
cursor.execute(..., async_op=True)→ handle stored in_async_handles. - User calls
cursor.fetchall()→ result set built. - User calls
cursor.close()→KernelResultSet.close()calls_kernel_handle.close()directly. _async_handles[command_id.guid]still holds the now-closed handle.- Later,
close_session()iterates_async_handles.values()and calls.close()again on the dead handle.
The kernel's close() is idempotent per the PR's docstring, so this isn't a crash — but the bookkeeping is inconsistent and leaks an entry for every async-submitted statement that closes through the result-set path. Long-lived connections accumulate dead entries.
Recommend: KernelResultSet.close() should either (a) call super().close() (which calls backend.close_command), or (b) explicitly self.backend._async_handles.pop(self.command_id.guid, None) after closing the handle.
There was a problem hiding this comment.
Fixed in 24e9a5c. KernelResultSet.close() now pops the entry from backend._async_handles (under the new _async_handles_lock). No-op for sync-execute and metadata paths, which never register there.
| self.has_more_rows = False | ||
| self.status = CommandState.SUCCEEDED | ||
| return False | ||
| if batch.num_rows > 0: |
There was a problem hiding this comment.
High severity — performance, empirically verified
def _buffered_rows(self) -> int:
if not self._buffer:
return 0
first = self._buffer[0].num_rows - self._buffer_offset
rest = sum(b.num_rows for b in list(self._buffer)[1:]) # allocates + O(M)
return first + restTwo issues:
list(self._buffer)[1:]allocates a fresh list on every call — gratuitous. Useitertools.islice(self._buffer, 1, None)or iterate the deque._ensure_bufferedcalls_buffered_rows()in a loop, once per pulled batch → O(M²) in batch count.
Empirically verified with a synthetic harness:
- 5,000 single-row batches → ~447ms just in the row-count loop (Python-side, no pyarrow).
Hot path: fetchmany(small_N) / fetchone() repeatedly when the kernel returns many small batches, or fetchall() over a deep stream.
Fix (~10 lines): track self._buffered_count: int as a running counter — += batch.num_rows in _pull_one_batch, -= take in _take_buffered, recompute on _drain. _buffered_rows() becomes O(1); _ensure_buffered becomes O(M).
There was a problem hiding this comment.
Fixed in 37fa544. Replaced _buffered_rows with a running counter _buffered_count maintained by _pull_one_batch / _take_buffered / _drain. _buffered_rows is now O(1); _ensure_buffered is O(M) in batch count instead of O(M²).
|
|
||
| @pytest.fixture(scope="session") | ||
| def host(): | ||
| return os.getenv("DATABRICKS_SERVER_HOSTNAME") |
There was a problem hiding this comment.
High severity — flagged by test, ops
The new pytest_collection_modifyitems skips every extra_params={"use_sea": True} case when the kernel wheel isn't importable. CI installs --all-extras (.github/workflows/code-coverage.yml:39), but pyproject explicitly does NOT declare a [kernel] extra (per the PR's own comment at pyproject.toml:55-68). Result: every one of those cases is reported SKIPPED on every CI run.
Verified via grep against tests/e2e/test_driver.py — there are ~14 distinct @pytest.mark.parametrize("extra_params", [{}, {"use_sea": True}]) functions (test_query_with_large_wide_result_set, test_long_running_query, test_execute_async__*, test_unicode, test_fetchone, test_fetchall, test_fetchmany_*, test_iterator_api, test_multi_timestamps_arrow, …) plus 4 SEA-parametrized retry tests in tests/e2e/common/retry_test_mixins.py.
Before this PR they ran against SeaDatabricksClient. After this PR they vanish from the CI matrix entirely. This is a silent capability loss across the whole e2e suite, not just within the new code. Combined with F9 (no unit tests for the 511-LOC client.py), coverage of the kernel backend in CI may also fail the 85% threshold gate at .github/workflows/code-coverage.yml:80-81.
Recommend: (a) install the kernel wheel in CI before running e2e, OR (b) document this regression in the PR description so reviewers know use_sea=True e2e coverage in CI is currently 0, AND (c) file a follow-up to land kernel-wheel CI coverage.
There was a problem hiding this comment.
Addressed in 24e9a5c — removed the conftest collection hook entirely. With use_sea=True back on the native SEA backend, the existing extra_params=[{}, {"use_sea": True}] parametrized cases run as they did before this PR (no skip needed). The kernel backend is now opt-in via use_kernel=True and doesn't intercept existing e2e parametrizations.
| @@ -0,0 +1,511 @@ | |||
| """``DatabricksClient`` backed by the Rust kernel via PyO3. | |||
There was a problem hiding this comment.
High severity — flagged by test, devils-advocate
auth_bridge, result_set, and type_mapping each get unit coverage; the largest, most behavior-rich file in the PR has none. Only the e2e file exercises it — and that file skips silently when creds OR the kernel wheel are missing (which is the CI default, see F8).
Concretely uncovered by any non-live test:
_CODE_TO_EXCEPTIONmapping (14 entries) — trivially testable with a fake_kernel.KernelError; a typo on"Unauthenticated"collapses silently toDatabaseError._reraise_kernel_errorattribute forwarding — copies 7 structured fields (code,sql_state,error_code,vendor_code,http_status,retryable,query_id). E2E only verifiescode+sql_state.open_sessiondouble-open guard —raise InterfaceError(...)not exercised.- All 5
InterfaceError "no open session"guards —execute_command,get_catalogs,get_schemas,get_tables,get_columns. get_columnscatalog_namerequired check — e2e always passes a catalog.execute_commandparameters / query_tagsNotSupportedError— regression risk that acceptingparameterswould silently dispatch to a kernelStatementwith nobind_param.cancel_command/close_commandno-handle tolerant path.close_sessioncleanup of_async_handlesincluding swallow-on-KernelError.get_query_statesync-path SUCCEEDED shortcut and Failed-state re-raise._STATE_TO_COMMAND_STATEmapping (6 entries).max_download_threadsproperty.get_tablestable_typeswarning behavior.
All achievable with a MagicMock _kernel module via monkeypatch.setattr. The auth-bridge and result-set tests demonstrate the pattern. The absence of tests/unit/test_kernel_client.py is the single biggest test-quality issue in this PR.
Recommend: add tests/unit/test_kernel_client.py — ~150 LOC covers the items above.
There was a problem hiding this comment.
Addressed in 24e9a5c. Added tests/unit/test_kernel_client.py with 38 cases covering: full _CODE_TO_EXCEPTION (14-entry parametrize), _reraise_kernel_error attribute forwarding, full _STATE_TO_COMMAND_STATE (6-entry parametrize), all 5 no-open-session guards, open_session double-open, parameters and query_tags rejection, get_columns catalog-required, cancel_command / close_command tolerance, get_query_state sync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId UUID shape, and close_session cleanup-on-failure. Uses a fake databricks_sql_kernel module installed into sys.modules so it runs without the Rust extension. 77/77 kernel unit tests pass locally.
| federated.add_headers = base.add_headers | ||
| kwargs = kernel_auth_kwargs(federated) | ||
| assert kwargs == {"auth_type": "pat", "access_token": "dapi-abc"} | ||
|
|
There was a problem hiding this comment.
Medium severity — flagged by language, maintainability, devils-advocate, test, security
federated = TokenFederationProvider.__new__(TokenFederationProvider)
federated.external_provider = base
federated.add_headers = base.add_headersThis bypasses __init__ (which requires http_client and normalizes hostname) AND monkey-patches over the real TokenFederationProvider.add_headers — the one containing all the token-exchange logic. The assertion kwargs == {"auth_type": "pat", "access_token": "dapi-abc"} therefore says nothing about whether the bridge handles the real federation flow. It passes "by accident" because a dapi-… token isn't a JWT, so the real path's _should_exchange_token happens to return False.
Any future change to TokenFederationProvider.add_headers (added telemetry, new refresh trigger, eager exchange) silently breaks the bridge while this test stays green. With a real-init federated provider, add_headers writes the federation-exchanged token (not the original PAT) — the bridge would extract that token, not the underlying PAT. The test name test_federation_wrapped_pat_routes_to_kernel_pat overstates what's verified.
tests/unit/test_token_federation.py:31-36 already demonstrates clean construction with a MagicMock http_client.
Recommend:
federated = TokenFederationProvider(
hostname="https://example.cloud.databricks.com",
external_provider=base,
http_client=Mock(),
)There was a problem hiding this comment.
Fixed in 37fa544. The test now constructs a real TokenFederationProvider(http_client=Mock()) and exercises its actual add_headers path; for a plain dapi-… PAT _should_exchange_token returns False (not a JWT) so no exchange fires and the mock http_client is never invoked.
| def get_catalogs( | ||
| self, | ||
| session_id: SessionId, | ||
| max_rows: int, |
There was a problem hiding this comment.
Medium severity — flagged by architecture, devils-advocate, ops
_async_handles is a single dict per KernelDatabricksClient. Multiple cursors on the same connection share it via self.backend. Mutations happen in execute_command (insert), close_command (pop), cancel_command (get), close_session (iterate-then-clear) — all unlocked.
Two threads issuing async statements concurrently are safe in CPython by GIL accident but not by design. Worse, close_session does for handle in list(self._async_handles.values()): ...; self._async_handles.clear() — a thread mid-execute_command(async_op=True) could add a new handle after the iterator copy is taken but before clear(). That handle is dropped on the floor with no .close() called — kernel-side state leaks.
The connector explicitly documents thread-safety per cursor; this regresses below that bar for shared async tracking.
Recommend: wrap mutations in a threading.RLock, or document non-thread-safety in the class docstring with an explicit warning.
There was a problem hiding this comment.
Fixed in 24e9a5c. Added self._async_handles_lock = threading.RLock() and wrapped every read/mutation site (execute_command insert, cancel_command / close_command / get_query_state / get_execution_result reads, close_session iterate+clear). The close_session pattern is now snapshot-under-lock then close-outside-lock — newly-added handles after the snapshot stay in the dict for the next sweep instead of being dropped on the floor.
| None, | ||
| ) | ||
| for field in schema | ||
| ] |
There was a problem hiding this comment.
Medium severity — flagged by agent-compat, maintainability
return str(arrow_type) for unrecognized types produces strings like "fixed_size_binary[16]" or "timestamp[ns, tz=UTC]" in cursor.description[i][1]. Code (and LLM agents) branching on description type strings — a common pattern, e.g., if col_type == "timestamp": — silently miss these cases.
The unit test only verifies pa.null() falls through to "null"; the visually ugly cases aren't covered.
pa.timestamp("us") and pa.timestamp("ns", tz="UTC") both pass is_timestamp and map to "timestamp" (fine), but other parametrized types (fixed-size, dictionary-encoded, union) fall to str(arrow_type).
Recommend: either (a) document the fallback shape in the module docstring so callers know to handle parameterized type strings, (b) lowercase + strip parameters before returning, or (c) extend the explicit list to cover the parametrized variants the kernel actually returns.
There was a problem hiding this comment.
Deferred. Documented as a follow-up — the kernel will eventually return a richer Arrow type surface, and the right shape is to expand the explicit table when kernel adds parameterized types we care about, not to lossily lowercase/strip on the connector side.
| return | ||
| try: | ||
| handle.close() | ||
| except _kernel.KernelError as exc: |
There was a problem hiding this comment.
Medium severity — flagged by language
Signature is (exc: BaseException) -> "Error" with # type: ignore[return-value] on the non-KernelError passthrough. The ignore is masking a real type issue: this function's contract is "either re-raise as Error or pass through."
Every caller in the file does raise _reraise_kernel_error(exc) after an isinstance(exc, KernelError) check (12 callers), so the non-KernelError passthrough is unreachable in practice.
Recommend: delete the dead branch and tighten the signature to (exc: _kernel.KernelError) -> Error (called only after an isinstance check). Or, if the passthrough is intentional, change return-type to Union[BaseException, Error] and drop the ignore. Deleting is simpler.
There was a problem hiding this comment.
Fixed in 37fa544. Tightened the signature to (exc: _kernel.KernelError) -> Error, dropped the unreachable passthrough branch and the # type: ignore[return-value], and replaced the defensive setattr try/except with a plain setattr(new, attr, getattr(exc, attr, None)) since none of the PEP 249 exception classes use __slots__.
| buffer_size_bytes=cursor.buffer_size_bytes, | ||
| ) | ||
|
|
||
| # ── Metadata ─────────────────────────────────────────────────── |
There was a problem hiding this comment.
Medium severity — flagged by maintainability, language
_use_arrow_native_complex_types: Optional[bool] = True is accepted in __init__ but never read. Not passed by session.py::_create_backend either. Drop entirely.
Also: Optional[bool] = True is essentially Union[bool, None] defaulted to True — the Optional is misleading because None and True are both acceptable for what would be a "use default" sentinel. If kept, restrict to bool = True.
There was a problem hiding this comment.
Fixed in 37fa544. Removed _use_arrow_native_complex_types from KernelDatabricksClient.__init__ — it was accepted but never read, and session.py::_create_backend doesn't pass it for the kernel branch.
| batch size; ``fetchall`` drains the whole stream. | ||
| """ | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
Medium severity — flagged by test, language
The class docstring claims the duck-typed kernel_handle must implement arrow_schema() / fetch_next_batch() / fetch_all_arrow() / close(). The production code never calls fetch_all_arrow — KernelResultSet streams via fetch_next_batch() + a custom _drain. The _FakeKernelHandle test double also doesn't implement fetch_all_arrow.
If the kernel adds a required method (e.g., fetch_next_batch(timeout=...) becomes mandatory), unit tests still pass and the regression lands in e2e — which is silently skipped per F8.
Recommend:
- Drop
fetch_all_arrowfrom the docstring contract, OR - Add a contract test that (when
databricks_sql_kernelis importable) asserts the duck-typed methods are actually exposed.
There was a problem hiding this comment.
Fixed in 37fa544. Renamed _metadata_result → _make_result_set and routed the sync-execute path (was client.py:510-517) and get_execution_result (was client.py:577-584) through it. Single construction site now.
| arraysize: int, | ||
| buffer_size_bytes: int, | ||
| ): | ||
| schema = kernel_handle.arrow_schema() |
There was a problem hiding this comment.
Low severity — flagged by performance
buffer_size_bytes is accepted by the constructor and forwarded to the base ResultSet, but never consulted by the kernel backend. The kernel currently caps buffer by rows-pulled, not bytes.
Recommend: document the no-op (a comment in the class docstring or constructor) so callers tuning buffer_size_bytes for memory ceilings on Thrift know it doesn't apply on use_sea=True.
There was a problem hiding this comment.
Fixed in 37fa544. Added a paragraph to the KernelResultSet class docstring explicitly documenting that buffer_size_bytes is accepted for base-class contract compatibility but is not consulted — kernel currently caps by rows pulled, not bytes.
| lz4_compressed=False, | ||
| arrow_schema_bytes=None, | ||
| ) | ||
| self._kernel_handle = kernel_handle |
There was a problem hiding this comment.
Low severity — flagged by maintainability
The base ResultSet expects a results_queue with a next_n_rows / remaining_rows / close interface — both ThriftResultSet and SeaResultSet use it. KernelResultSet passes results_queue=None and duplicates _pull_one_batch / _ensure_buffered / _take_buffered / _drain (~80 lines) plus its own fetch overrides.
The docstring even acknowledges the duplication: "Buffer shape mirrors the prior ADBC POC's AdbcResultSet."
Recommend (follow-up): extract BufferedArrowQueue(results_queue) wrapping any handle implementing arrow_schema() / fetch_next_batch() / close(). Both KernelResultSet and any future AdbcResultSet become 15-line constructor-only subclasses, and the base ResultSet.fetchmany_arrow / fetchall_arrow works unchanged.
There was a problem hiding this comment.
Deferred — this is a cross-cutting refactor that would touch both kernel and SEA backends. Tracked as a follow-up; appropriate to land alongside the ADBC POC if/when it gets revived.
| buffer_size_bytes=cursor.buffer_size_bytes, | ||
| ) | ||
|
|
||
| def _synthetic_command_id(self) -> CommandId: |
There was a problem hiding this comment.
Low severity — flagged by security
self._auth_kwargs = {"auth_type": "pat", "access_token": <raw_token>} is stored on the KernelDatabricksClient instance for its life. Thrift / native-SEA materialize the token only inside per-request add_headers calls. The kernel client elevates the cleartext token to a long-lived attribute on a connector object — at risk of accidental pickling, debugger dumps, or telemetry capture.
Recommend: clear self._auth_kwargs (or just self._auth_kwargs["access_token"]) immediately after _kernel.Session(...) returns in open_session. Or move it into a closure rather than an instance attribute.
There was a problem hiding this comment.
Fixed in 37fa544. Added a finally block to open_session that pops access_token from self._auth_kwargs after the kernel Session is constructed (or failed). Kernel owns the credential from then on; no cleartext copy stays on the long-lived connector object.
| kernel side.""" | ||
| with conn.cursor() as cur: | ||
| cur.execute("SELECT * FROM range(10000)") | ||
| rows = cur.fetchall() |
There was a problem hiding this comment.
Low severity — flagged by test
Docstring says SELECT * FROM range(10000) "exercises the CloudFetch / multi-batch path on the kernel side". 10000 BIGINT rows is ~80 KB — almost certainly a single inline chunk on a typical warehouse. Existing CloudFetch-aimed tests (tests/e2e/test_driver.py:145-180) use 100 MB / 12.5M rows.
The comment overstates scope. A future reader may believe CloudFetch is covered when it isn't.
Recommend: drop the misleading claim, or scale to range(2_000_000) to actually cross a chunk boundary.
There was a problem hiding this comment.
Fixed in 37fa544. Replaced the misleading 'exercises CloudFetch / multi-batch path' claim with a note that it covers end-of-stream drain over multiple fetch_next_batch calls and isn't large enough for CloudFetch — pointing to test_driver for CloudFetch coverage.
|
|
||
| from __future__ import annotations | ||
|
|
||
| from unittest.mock import MagicMock |
There was a problem hiding this comment.
Low severity — flagged by test
from unittest.mock import MagicMock is imported but never used in this file. Dead import.
There was a problem hiding this comment.
Fixed in 37fa544. Replaced MagicMock import with Mock since the federation test now needs the latter; no longer unused.
Code Review Squad — Failed Inline CommentsCould not post inline comments for: F2, F4, F7, F10, F11, F13, F14, F19, F22, F25 — see body below. F2 — Federated-PAT token refresh is dead — single snapshot at construct timeHigh severity — flagged by security + devils-advocate
The bridge captures only the first exchanged token and never re-extracts. Long-running kernel sessions outlive the exchanged token's TTL and start failing The bridge's own docstring acknowledges this failure mode while justifying OAuth rejection: "routing OAuth through PAT would silently break token refresh during long-running sessions." That exact failure mode applies to the federated-PAT case the bridge accepts. Recommend: either (a) propagate a refresh callback into the kernel F4 —
|
Cleanup pass on the kernel-backend PR addressing reviewer feedback that doesn't change observable behaviour: - result_set.py: replace O(M²) `_buffered_rows` with running counter `_buffered_count` maintained by pull/take/drain (perf F6). - result_set.py: docstring corrections — drop nonexistent `fetch_all_arrow` from kernel-handle contract (F20); document `buffer_size_bytes` as no-op on the kernel backend (F21). - client.py: tighten `_reraise_kernel_error` signature to `_kernel.KernelError` only; drop dead passthrough branch and the defensive setattr try/except (F17). - client.py: drop unused `_use_arrow_native_complex_types` kwarg (F18). - client.py: collapse three `KernelResultSet(...)` construction sites through `_make_result_set` (renamed from `_metadata_result`) (F19). - client.py: drop `metadata-` prefix from synthetic CommandId; use a plain `uuid.uuid4().hex` so anything reading `cursor.query_id` downstream sees a UUID-shaped string (F14). - client.py: clear the raw access token from `_auth_kwargs` after the kernel session is constructed — kernel owns the credential from then on, no need to retain a cleartext copy on the connector instance (F24). - auth_bridge.py: reject bearer tokens containing ASCII control characters at extraction time (defense-in-depth against header injection if a misbehaving HTTP stack ever places the token back into a header without scrubbing) (F25). - tests/unit/test_kernel_auth_bridge.py: construct a real `TokenFederationProvider(http_client=Mock())` instead of bypassing `__init__` with `__new__` + monkey-patching `add_headers`. Exercises the real federation passthrough path the bridge sees in production (F12). Drop unused `MagicMock` import (F27). - tests/e2e/test_kernel_backend.py: drop misleading CloudFetch claim on `test_drain_large_range_to_arrow` — 10000 BIGINT rows is ~80 KB, single inline chunk on a typical warehouse (F26). All 39 existing kernel unit tests pass. Co-authored-by: Isaac
…ve review fixes
Major change: route the kernel backend through a new ``use_kernel=True``
connection kwarg instead of repurposing ``use_sea=True``. ``use_sea=True``
once again routes to the native pure-Python SEA backend (no behaviour
change); ``use_kernel=True`` routes to the Rust kernel via PyO3. The
two flags are mutually exclusive.
This addresses the largest reviewer concern from the multi-agent
review: silently hijacking a documented public flag broke OAuth /
federation / parameter-binding callers on ``use_sea=True`` who had no
opt-out. With the new flag, the kernel backend is fully opt-in and
existing ``use_sea=True`` users continue to get the native SEA backend
they signed up for.
Other substantive fixes:
- session.py: restore ``SeaDatabricksClient`` import + routing. Reject
``use_kernel=True`` + ``use_sea=True`` together with a clear
``ValueError``.
- client.py (kernel ``Cursor.columns``): update docstring to flag the
``catalog_name=None`` divergence — kernel requires a catalog,
Thrift / native SEA do not (F13).
- conftest.py: drop the collection-time ``pytest_collection_modifyitems``
hook that was skipping ``extra_params={"use_sea": True}`` cases. With
``use_sea=True`` back on the native SEA backend, those cases run as
they did before this PR (F8).
- kernel/client.py: ``get_tables`` now applies the ``table_types``
filter client-side using ``ResultSetFilter._filter_arrow_table``
(the same helper the native SEA backend uses), wrapped in a tiny
``_StaticArrowHandle`` that flows the filtered table back through
the normal ``KernelResultSet`` path. Replaces the previous
"log a warning and return unfiltered" behaviour (F4).
- kernel/client.py: guard ``_async_handles`` with ``threading.RLock``
so concurrent cursors on the same connection don't race on
submit / close / close-session (F15).
- kernel/result_set.py: ``KernelResultSet.close()`` now drops the
entry from ``backend._async_handles`` so async-submitted statements
don't leave stale references behind (F5).
- kernel/{__init__,client,auth_bridge}.py, tests/e2e/test_kernel_backend.py:
update docstrings, error messages, and the e2e fixture to refer to
``use_kernel=True`` instead of ``use_sea=True``.
- client.py (``Connection`` docstring): document the new
``use_kernel`` kwarg + its Phase-1 limitations.
New tests:
- tests/unit/test_kernel_client.py (38 cases): cover the 14-entry
``_CODE_TO_EXCEPTION`` table, ``_reraise_kernel_error`` attribute
forwarding, the 6-entry ``_STATE_TO_COMMAND_STATE`` table, the
no-open-session guards on every method, ``open_session`` double-open,
``parameters`` / ``query_tags`` rejection, ``get_columns``'
catalog-required check, ``cancel_command`` / ``close_command``
no-handle tolerance, ``get_query_state`` sync-path SUCCEEDED, the
Failed-state re-raise, the synthetic-command-id UUID shape, and
``close_session`` cleanup even when per-handle close errors fire.
Uses a fake ``databricks_sql_kernel`` module installed into
``sys.modules`` so the test runs with no Rust extension dependency
(F9).
77/77 kernel unit tests pass.
Co-authored-by: Isaac
Code-review responses — summary-only findingsReplies to the findings that landed in the summary comment (couldn't be posted as inline replies because the cited line was outside the diff hunk):
77/77 kernel unit tests passing locally on the new branch tip |
Summary
Phase 2 of the PySQL × kernel integration plan (design doc). Adds a new opt-in
use_kernel=Trueconnection flag that routes through a newbackend/kernel/module delegating to the Rust kernel via thedatabricks_sql_kernelPyO3 extension (kernel PR #13).This replaces the previous ADBC POC branches (
backend/adbc/andbackend/adbc_dm/onadbc-rust-backend-via-dm, which were never merged) with a clean port that uses the kernel's v0 Databricks-native API directly instead of layering through ADBC.Flag semantics
use_kernel=TrueKernelDatabricksClient(Rust kernel via PyO3)use_sea=TrueSeaDatabricksClient(pure-Python SEA)ThriftDatabricksClient(Thrift)use_kernel=Trueanduse_sea=Trueare mutually exclusive — passing both raisesValueError. Existinguse_sea=Truecallers are unaffected.What
use_kernel=TruedoesNew module layout
src/databricks/sql/backend/kernel/client.pyKernelDatabricksClient(DatabricksClient). Lazy-importsdatabricks_sql_kernelso a connector install without the kernel wheel doesn't fail at startup — onlyuse_kernel=Truesurfaces the missing-extra ImportError.src/databricks/sql/backend/kernel/auth_bridge.pyAuthProviderto kernelSessionauth kwargs. PAT (includingTokenFederationProvider-wrapped PAT — every provider is wrapped, so the naiveisinstance(AccessTokenAuthProvider)check has to look through the wrapper) routes throughauth_type='pat'. Anything else raisesNotSupportedErroruntil the kernel exposes a full external-auth surface.src/databricks/sql/backend/kernel/result_set.pyKernelResultSet(ResultSet). Duck-typed over the kernel'sExecutedStatement(sync exec) andResultStream(metadata + asyncawait_result); both exposearrow_schema / fetch_next_batch / close. FIFO batch buffer forfetchmany(n)semantics, with O(1) buffered-row accounting via a running counter.src/databricks/sql/backend/kernel/type_mapping.pyError mapping
KernelError.code→ PEP 249 exception class, in a single table inclient.py. Structured fields (sql_state,error_code,query_id,http_status,retryable,vendor_code) are copied onto the re-raised exception so callers can branch onerr.code/err.sql_statedirectly. Live e2e verified: bad SQL onuse_kernel=Truesurfaces asDatabaseError(code='SqlError', sql_state='42P01').Packaging
Without the kernel wheel,
use_kernel=Trueraises:Local dev:
cd databricks-sql-kernel/pyo3 && maturin develop --releaseinto the connector's venv. (The[kernel]extra is intentionally not declared inpyproject.tomlyet —databricks-sql-kernelisn't on PyPI, and declaring an unpublished dep breakspoetry lockfor every CI job. The extra will land once the wheel is on PyPI.)execute_command(parameters=[...])raisesNotSupportedErrorStatement.bind_paramlands in a small follow-up PR on the kernel repoquery_tagsNotSupportedErrorstatement_confget_columns(catalog_name=None)ProgrammingError(kernel'sSHOW COLUMNScannot span catalogs)Cursor.columns()docstringNotSupportedErrorfrom the auth bridgeexecution_result/retry_count/ chunk-level latency under-report onuse_kernel=Truessl_options/http_headers/http_clientare accepted-and-ignoredCode review feedback addressed in this revision
Multi-reviewer (architecture, security, ops, performance, test, maintainability, agent-compat, language, devil's advocate) review surfaced several issues; the highest-impact ones are addressed in commits
37fa5446(mechanical) and24e9a5c2(substantive):use_sea=Trueto a dedicateduse_kernel=True; native SEA routing is unchanged. (Was the largest reviewer concern.)table_typesfilter inget_tablesusing the SEA backend's_filter_arrow_tablehelper, replacing the previous "log a warning and return unfiltered" behaviour.KernelResultSetwith a running counter (_buffered_count).KernelResultSet.close()now drops the entry frombackend._async_handlesto avoid stale references.threading.RLockaround_async_handlesmutations / reads for concurrent-cursor safety.CommandIds use plainuuid.uuid4().hex(nometadata-prefix) socursor.query_idstays parseable downstream.KernelDatabricksClient._auth_kwargsafter the kernelSessionis constructed._use_arrow_native_complex_typeskwarg; tightened_reraise_kernel_errorsignature and dropped dead branch; collapsed threeKernelResultSet(...)construction sites through one_make_result_sethelper.Cursor.columns()docstring now documents thecatalog_name=Nonedivergence onuse_kernel=True.TokenFederationProvider(http_client=Mock())instead of bypassing__init__.Test plan
tests/unit/test_kernel_client.py(new in this revision, 38 cases) — covers_CODE_TO_EXCEPTION(14 entries),_reraise_kernel_errorattribute forwarding,_STATE_TO_COMMAND_STATE(6 entries), all no-open-session guards,open_sessiondouble-open, parameters / query_tags rejection,get_columnscatalog-required,cancel_command/close_commandtolerance,get_query_statesync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId shape,close_sessioncleanup on partial close failures. Uses a fakedatabricks_sql_kernelmodule so the test runs with no Rust extension dependency.tests/unit/test_kernel_auth_bridge.py— PAT, federation-wrapped PAT (now via realTokenFederationProvider(http_client=Mock())), non-PAT rejection paths.tests/unit/test_kernel_type_mapping.py— Arrow type mapping per type, description-tuple shape, fallback tostr()for unknowns.tests/unit/test_kernel_result_set.py— buffer semantics,fetchmanyslicing within batch + across batch boundaries, idempotent close, close() swallowing handle-close failures, empty stream.test_useragent_header— agent detection addsagent/claude-codein this env, fails onmaintoo) is unrelated to this change.use_kernel=True(PAT): SELECT 1,SELECT * FROM range(10000),fetchmanypacing,fetchall_arrow, all four metadata calls (catalogs / schemas / tables / columns),session_configuration={'ANSI_MODE': 'false'}round-trips, bad SQL surfaces as DatabaseError withcode='SqlError'andsql_state='42P01'.This pull request and its description were written by Isaac.