fix: stop FDv2DataSource.Conditions from leaking on healthy primary by kinyoklion · Pull Request #163 · launchdarkly/java-core

kinyoklion · 2026-05-15T21:07:04Z

Summary

FDv2DataSource.Conditions.getFuture() returned the same shared CompletableFuture<Object> instance to every caller. The run loop does CompletableFuture.anyOf(getFuture(), synchronizer.next()).get() on every iteration, which attaches a new OrRelay Completion to the shared future's stack each time. CompletableFuture has no deregister path for the loser of an anyOf race, so those Completion nodes stay on the stack until the shared future itself completes.

On a healthy primary streaming ChangeSets without ever firing fallback/recovery, the shared future never completes — the stack grows monotonically for the synchronizer's entire tenure (effectively the SDK's uptime on a stable server).

Per-iteration cost: ~200 B (OrRelay + anyOf result CF + chain references).
At 10 ChangeSets/sec sustained: ~150 MB/day per active synchronizer.

The fix

A single permanent whenComplete listener on the underlying aggregate fans out completion to every fresh future handed out by getFuture(). Pending fresh futures are tracked via WeakReference, so a fresh future whose only strong references were the caller's local variables (typical lifetime: one loop iteration) becomes garbage-collectable once that iteration ends. Pending entries whose referent has been collected are pruned opportunistically on each getFuture() call and on close().

Conditions is now package-private (was private) so direct unit tests can reach it. A test-only pendingSize() helper is added.

Test plan

Adds FDv2DataSourceConditionsAggregateTest with five tests:

getFutureReturnsDistinctInstancesPerCall — bug-prover. Fails on the pre-fix shared-instance behavior, passes after the fix.
getFutureReturnsDistinctInstancesEvenWithNoConditions — bug-prover. Covers the empty-conditions case (single-synchronizer configuration), which is exactly where per-iteration accumulation would be most damaging.
allFreshFuturesCompleteWhenAggregateFires — verifies fan-out via the single permanent listener actually delivers to multiple fresh futures handed out before the aggregate fires.
getFutureAfterAggregateFiresReturnsCompletedFuture — verifies the fast path: callers arriving after completion get an already-completed future synchronously.
pendingListDoesNotGrowUnboundedlyWhenFreshFuturesAreDropped — 10k-iteration soak test that simulates the run-loop pattern (race a fresh future against a fast-resolving sibling, drop the result) and asserts the pending list stays bounded via GC + opportunistic pruning. Caveat in the test docstring about System.gc() not being guaranteed — if it ever flakes on CI we can migrate to -XX:+UseSerialGC or relax the ceiling.

Verified bug-proving discipline: the two distinctness tests fail on the pre-fix shared-instance behavior and pass after the fix. The full server-sdk test suite (1857 tests across 109 classes) is clean.

Context

This was identified during a multi-agent review of the analogous cpp-sdks PR (launchdarkly/cpp-sdks#531), which mirrors this Java implementation's Conditions design. The cpp version has the same structural leak; this Java fix shape is what was prototyped there. Filing here first since the runtime impact on a long-running JVM-based server SDK is more pronounced.

Note

Medium Risk
Touches the FDv2 synchronizer condition-aggregation logic used in the main run loop; mistakes could cause missed fallback/recovery signals or incorrect exceptional completion behavior, though changes are localized and covered by new unit tests.

Overview
Prevents a long-lived memory leak in FDv2DataSource.Conditions by changing getFuture() to return a fresh CompletableFuture per call until the underlying condition aggregate completes, rather than returning the same shared pending future each iteration.

Adds a single whenComplete fan-out from the aggregate to complete all outstanding per-call futures (and to propagate exceptional completion), tracks pending futures in a WeakHashMap-backed set for GC cleanup, and makes Conditions package-private to allow direct testing.

Introduces FDv2DataSourceConditionsAggregateTest to assert per-call distinctness, correct completion fan-out, and correct behavior on exceptional and post-completion paths.

^{Reviewed by Cursor Bugbot for commit a6c7c36. Bugbot is set up for automated code reviews on this repo. Configure here.}

FDv2DataSource's run loop calls CompletableFuture.anyOf(conditions.getFuture(), synchronizer.next()).get() on every iteration. Before this change, getFuture() returned the same shared CompletableFuture<Object> instance to every caller. Each anyOf call attaches an OrRelay Completion node to the shared instance's stack; CompletableFuture has no deregister path for the loser of a race, so the OrRelay stays on the stack until the shared future completes. The shared future only completes when fallback or recovery fires. On a healthy primary streaming ChangeSets, fallback is never armed and recovery is suppressed (only-available-synchronizer / single-prime configurations). The future never completes; the stack grows monotonically for the synchronizer's full tenure -- effectively the SDK's uptime on a healthy server. Per-iteration cost ~200 B: an OrRelay Completion plus the anyOf result CompletableFuture plus the chain references back to the inputs. At 10 ChangeSets/sec sustained that is ~150 MB/day per active synchronizer. The fix: a single permanent whenComplete listener on the underlying aggregate fans out completion to every fresh future handed out by getFuture(). Pending fresh futures are tracked via WeakReference, so a fresh future whose only strong references were the caller's local variables (typical lifetime: one loop iteration) becomes garbage-collectable once that iteration ends. Pending entries whose referent has been collected are pruned opportunistically on each getFuture() call and on close(). Conditions is now package-private rather than private so the new direct unit tests can reach it. Adds a test-only pendingSize() helper. Verified bug-proving discipline: two of the new tests (getFutureReturnsDistinctInstancesPerCall, getFutureReturnsDistinctInstancesEvenWithNoConditions) fail on the pre-fix shared-instance behavior and pass after the fix. Full server-sdk test suite (1857 tests) is clean.

The pendingListDoesNotGrowUnboundedlyWhenFreshFuturesAreDropped test needed System.gc() + Thread.sleep to encourage reclamation, which is brittle. The two distinctness tests are sufficient bug-provers for the shared-instance behavior they fail on, so drop the soak test and the test-only pendingSize() helper that supported it.

cursor

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 536176e. Configure here.}

Bugbot pointed out that completedValue (volatile Object) used null as a sentinel for "not yet fired", and the whenComplete listener also stored null on exceptional completion. After exceptional completion a subsequent getFuture() entered the lock, saw pending == null (drained by the listener), and returned CompletableFuture.completedFuture(null) -- silently converting the exception into a null result. The run loop would then NPE on res.getClass(). Replace the null-sentinel pattern with an explicit volatile boolean isFired plus separate firedResult/firedThrowable fields. A new makeCompletedFuture() helper builds a fresh completed future mirroring whichever terminal state the aggregate reached. Adds a bug-proving test (getFutureFailsExceptionallyWhenAggregateFails- Exceptionally) that drives a manually-controlled condition to completeExceptionally and asserts both pre- and post-firing getFuture() results throw ExecutionException with the original cause. Verified the test fails on the pre-fix null-sentinel behavior ("expected ExecutionException, got normal completion").

kinyoklion · 2026-05-15T21:43:57Z

launchdarkly/cpp-sdks#531 (comment)

beekld

I really like this solution to the problem, but have a few comments about how it might be simplified / improved.

(Unfortunately, this will be more complicated in C++, since weak references aren't really built into the language.)

beekld · 2026-05-15T22:27:57Z

+         * true. Written under {@code lock}; readable without the lock once
+         * {@code isFired} has been observed true (volatile happens-before).
+         */
+        private Object firedResult;


Do we really need firedResult and firedThrowable? It seems like if isFired = true, then we can read aggregate.getNow(null) to get the result/throwable. It's safe to call that without a lock if the future is complete, and it means we don't have to worry about getting the order of setting them correct.

Actually, wouldn't it even be safe for makeCompletedFuture to just return aggregate if isFired == true? It won't grow unbounded, because whatever continuations get added to it will be fired and removed immediately.

Done in a6c7c36. Returning aggregate directly when aggregate.isDone() == true -- continuations registered on an already-completed CompletableFuture fire synchronously at registration and cleanStack removes them immediately, so the original per-iteration accumulation can't re-occur in the post-completion path. Dropped isFired, firedResult, firedThrowable, and makeCompletedFuture().

beekld · 2026-05-15T22:35:14Z

+         * (and all pending entries have been drained). Mutated only under
+         * {@code lock}.
+         */
+        private List<WeakReference<CompletableFuture<Object>>> pending = new ArrayList<>();


I guess it's an improvement to store an empty WeakReference<CompletableFuture<Object>> instead of an unresolved CompletableFuture<Object>, but won't this list still grow unbounded? Maybe we should use a WeakHashMap instead, so the collection itself gets automatically pruned?

I can see if that would work better.

Right now it doesn't grow unbounded because it prunes the old entries when its adding the new entry.

while (it.hasNext()) { if (it.next().get() == null) { it.remove(); } }

Ah, missed that. Either way then.

Switched to a WeakHashMap-backed Set in a6c7c36 -- entries auto-prune when GC reclaims their keys, no more manual loop. Listener snapshots new ArrayList<>(pending) under the lock so entries surviving GC to that point get strong refs and are completed.

@beekld

Two simplifications suggested by @beekld: 1. Return aggregate directly when it has already completed. Continuations registered on an already-completed CompletableFuture fire synchronously at registration time and are removed from the stack immediately by cleanStack, so the original leak (per-iteration anyOf accumulation on a never-completing aggregate) cannot re-occur in the post-completion path. Drops the isFired flag, firedResult/firedThrowable fields, and the makeCompletedFuture helper. 2. Replace the WeakReference list with a WeakHashMap-backed Set. Fresh pending futures get GC'd automatically when the caller's loop iteration drops its strong reference, with no manual prune loop in getFuture(). All five aggregate tests still pass; full server-sdk suite (1857 tests) still passes. Verified bug-proving discipline: temporarily reverting getFuture() to the pre-fix shared-instance behavior makes the two distinctness tests + the exceptional-path test go red, exactly as before.

kinyoklion requested a review from a team as a code owner May 15, 2026 21:07

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread lib/sdk/server/src/main/java/com/launchdarkly/sdk/server/FDv2DataSource.java Outdated

kinyoklion marked this pull request as draft May 15, 2026 21:23

kinyoklion marked this pull request as ready for review May 15, 2026 21:43

beekld requested changes May 15, 2026

View reviewed changes

beekld approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop FDv2DataSource.Conditions from leaking on healthy primary#163

fix: stop FDv2DataSource.Conditions from leaking on healthy primary#163
kinyoklion wants to merge 4 commits into
mainfrom
rlamb/fdv2-conditions-leak-fix

kinyoklion commented May 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

kinyoklion commented May 15, 2026

Uh oh!

beekld left a comment

Uh oh!

beekld May 15, 2026

Uh oh!

beekld May 15, 2026

Uh oh!

kinyoklion May 15, 2026

Uh oh!

beekld May 15, 2026

Uh oh!

kinyoklion May 15, 2026

Uh oh!

beekld May 15, 2026

Uh oh!

kinyoklion May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kinyoklion commented May 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The fix

Test plan

Context

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kinyoklion commented May 15, 2026

Uh oh!

beekld left a comment

Choose a reason for hiding this comment

Uh oh!

beekld May 15, 2026

Choose a reason for hiding this comment

Uh oh!

beekld May 15, 2026

Choose a reason for hiding this comment

Uh oh!

kinyoklion May 15, 2026

Choose a reason for hiding this comment

Uh oh!

beekld May 15, 2026

Choose a reason for hiding this comment

Uh oh!

kinyoklion May 15, 2026

Choose a reason for hiding this comment

Uh oh!

beekld May 15, 2026

Choose a reason for hiding this comment

Uh oh!

kinyoklion May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kinyoklion commented May 15, 2026 •

edited by cursor Bot

Loading