[python] Support time travel by timestamp and watermark in TimeTravelUtil#7899
Conversation
df56ffc to
c149409
Compare
c149409 to
1eac751
Compare
Add scan.timestamp-millis, scan.timestamp, and scan.watermark options to CoreOptions and extend TimeTravelUtil to resolve snapshots using these parameters, aligning with Java-side time travel capabilities. - Add SCAN_TIMESTAMP_MILLIS, SCAN_TIMESTAMP, SCAN_WATERMARK config options and accessor methods to CoreOptions - Extend TimeTravelUtil.try_travel_to_snapshot() to handle timestamp and watermark based time travel - Add SnapshotManager.later_or_equal_watermark() for watermark lookup - Parse scan.timestamp using local timezone (matching Java behavior) - Unify all point-in-time scan options in TableScan through TimeTravelUtil, ensuring mutual-exclusion validation and consistent schema/data resolution - Reject conflicting incremental-between-timestamp and point-in-time scan options
1eac751 to
a76c0ba
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [python] Support time travel by timestamp and watermark in TimeTravelUtil
Overall well-structured. The refactoring of table_scan.py to delegate all time-travel logic to TimeTravelUtil is a clean separation of concerns.
Correctness
-
later_or_equal_watermark binary search relies on monotonically non-decreasing watermarks. If a snapshot has watermark = None, the code does
earliest = mid + 1(treats it as "too small"). This is correct under Paimon's guarantee, but a brief comment noting the invariant assumption would help future readers. -
Gap handling degrades binary search to O(n) in worst case. When
get_snapshot_by_id(mid)returns None, the linear scan from mid+1 to latest could be expensive if many consecutive snapshot IDs have been expired. Consistent with existingearlier_or_equal_time_millsbut worth noting. -
Dead-code defensiveness in table_scan.py. The check "if snapshot is None: raise ValueError(...)" after
try_travel_to_snapshotis unreachable whenhas_time_travelis True, because all paths inside either return a valid snapshot or raise. A comment would clarify intent.
Minor / Style
-
Error message formatting --
"point-in-time scan options: %s" % SCAN_KEYSrenders the list with Python repr syntax. Consider", ".join(SCAN_KEYS)for cleaner output. -
No direct unit test for binary search edge cases -- The tests use
_StubSnapshotManagerwith simple linear implementation. Consider a focused test forlater_or_equal_watermarkwith gaps in snapshot IDs and None watermarks.
Design
- Good alignment with Java side. The
_parse_timestamp_to_millisusing naive datetime is consistent with Java'sTimeZone.getDefault()semantics.
Nice work overall -- test cases are thorough and code is readable.
|
+1 |
Purpose
Currently pypaimon's TimeTravelUtil only supports time travel by scan.tag-name and scan.snapshot-id. This PR adds support for three additional time travel modes that are available on the Java side: