Skip to content

Wait for MARKER_RECORDED to fire version callback on replay#2821

Open
eamsden wants to merge 16 commits into
masterfrom
eamsden/interleaved-UpdateCompleted-getversion
Open

Wait for MARKER_RECORDED to fire version callback on replay#2821
eamsden wants to merge 16 commits into
masterfrom
eamsden/interleaved-UpdateCompleted-getversion

Conversation

@eamsden
Copy link
Copy Markdown
Contributor

@eamsden eamsden commented Mar 30, 2026

What was changed

  • Changed VersionStateMachine replay behavior behind a new SDK flag VERSION_WAIT_FOR_MARKER so getVersion() no longer resumes workflow code when the fake RECORD_MARKER command is created during replay.
  • Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback.

Why?

Replay was resuming getVersion() too early relative to update completion protocol handling. In the interleaved update history from issue #2796, that allowed replay to queue the second version marker before the update completion work had been matched, which produced a replay-only nondeterminism error:

[TMPRL1100] getVersion call before the existing version marker event

Delaying the replay callback until the actual marker event is matched makes version replay ordering consistent with replayed side effects and fixes replay for the recorded history.

Checklist

  1. Closes Replay fails: UpdateCompleted events between version markers breaks replay #2796 but does not fix replay of existing histories.

  2. How was this tested:

  • ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
  • ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
  • Verified that the original repro history now replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler.
  1. Any docs updates needed?
  • No docs updates needed. This is an internal replay-ordering fix with regression tests.

@eamsden eamsden changed the title Eamsden/interleaved update completed getversion Wait for MARKER_RECORDED to fire version callback on replay Mar 30, 2026
@eamsden eamsden marked this pull request as ready for review March 30, 2026 17:51
@eamsden eamsden requested a review from a team as a code owner March 30, 2026 17:51
@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch from ff59a05 to 43ab9df Compare March 31, 2026 17:08
* Changes replay behavior of GetVersion to wait for the matching marker event before executing
* the callback.
*/
VERSION_WAIT_FOR_MARKER(5),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting more of these. We should think ahead at how to make sure this remains maintainable in the future.

Would you mind adding some comments on this flag and others above, indicating:

  • Which release introduced support for the flag (assume next patch release for new flags)
  • Which release turned it on (or "pending" if not yet turned on)
  • Ticket to the bug that the flag resolves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this.

@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch 2 times, most recently from 31e4195 to 316a40b Compare April 1, 2026 21:42
@eamsden eamsden marked this pull request as draft April 1, 2026 21:52
@eamsden
Copy link
Copy Markdown
Contributor Author

eamsden commented Apr 1, 2026

Converting to draft until at least the shut-off of SKIP_YIELD_ON_VERSION is reverted.

eamsden added 12 commits June 3, 2026 16:24
Add a Java replay test that mirrors the Kotlin GreetingWorkflow sample and replays the provided workflow history fixture.

Assert that replay fails with the embedded TMPRL1100 NonDeterministicException message so the reproducer stays pinned to the reported failure mode.
Run a constrained experiment for the interleaved update replay bug by changing VersionStateMachine replay timing only for histories with SKIP_YIELD_ON_VERSION set. In that path, getVersion still returns synchronously, but the replay callback is no longer fired at fake RECORD_MARKER command creation and is instead delayed until the real MARKER_RECORDED event is matched.

The goal of the experiment was to verify that flagged histories do not depend on the current early replay callback or its extra eventLoop scheduling. The legacy interleaved update repro history does not have SKIP_YIELD_ON_VERSION, so it continues to fail unchanged and serves as the control case.

Verified with:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest"
Change VersionStateMachine replay semantics so getVersion no longer resumes workflow code when the fake RECORD_MARKER command is created. Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback, which makes version-marker ordering consistent with replayed side effects.

This fixes the interleaved update replay bug reproduced by testGetVersionInterleavedUpdateReplayHistory.json. That history previously failed replay with [TMPRL1100] because the second getVersion callback ran ahead of update completion protocol handling. After this change, the same recorded history replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler.

The earlier flag-gated experiment showed that delaying the callback was already safe for histories with SKIP_YIELD_ON_VERSION. This commit removes that temporary gating and applies the same replay ordering to all histories.

Verified with:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
have elected not to fix that specific history. Make sure there is a
run-then-replay regression test which confirms new histories won't be
broken that way.
@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch from 316a40b to 648d583 Compare June 3, 2026 21:31
@eamsden eamsden marked this pull request as ready for review June 3, 2026 21:33
@eamsden eamsden enabled auto-merge (squash) June 3, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replay fails: UpdateCompleted events between version markers breaks replay

2 participants