fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND by Raman369AI · Pull Request #5 · Raman369AI/adk-python

Raman369AI · 2026-05-10T04:08:32Z

Summary

Closes a latent defect in SkillToolset that lets the LLM enter an unbounded retry loop when load_skill_resource returns RESOURCE_NOT_FOUND. Because the only existing backstop is RunConfig.max_llm_calls (default 500), a hallucinated path can quietly burn the entire per-invocation call budget before the framework intervenes — and max_llm_calls is a global cap on legitimate reasoning, not a defense against this specific failure mode.

This fix adds invocation-scoped termination to the tool itself, plus a complementary system-prompt rule, so the framework no longer relies on a perfect upstream prompt to avoid unbounded loops on a known error path.

Why this matters

This is a structural problem, not an edge case:

The L2 load_skill response intentionally omits a manifest of available files — that is the agentskills.io progressive-disclosure contract, and it is correct for token economy. But it means the LLM must infer paths from prose instructions inside SKILL.md. Inferred paths are routinely wrong, even with strong models.
RESOURCE_NOT_FOUND is returned as a structured soft string. From the model's perspective it looks transient and recoverable, exactly like a flaky network error — so it retries. There is no terminal signal in the current implementation.
The LLM hallucinates a different path on every retry. In live testing the model never repeated the same path twice — it generated a new plausible variant each time. A per-path guard is therefore insufficient; the termination must be invocation-wide.
Nothing in SkillToolset distinguishes "first miss" from "500th miss". The loop terminates only when max_llm_calls (default 500) is hit.
Confounding scope ambiguity: the default system prompt does not draw a boundary between skill-bundled files (the legitimate target of load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing). A model asked to analyze a runtime document will sometimes route it through load_skill_resource, hit RESOURCE_NOT_FOUND, and loop — even though the file was never a skill resource.

Live trace evidence

Captured via GET /debug/trace/session/{session_id} against adk web with a patched build:

SPAN: execute_tool load_skill_resource
  args:       {'file_path': 'references/reference_doc.md', 'skill_name': 'document-classifier'}
  error_code: RESOURCE_NOT_FOUND
  error:      Resource 'references/reference_doc.md' not found in skill 'document-classifier'.

SPAN: execute_tool load_skill_resource
  args:       {'skill_name': 'document-classifier', 'file_path': 'references/Document1.md'}
  error_code: RESOURCE_NOT_FOUND_FATAL
  error:      Resource 'references/Document1.md' not found in skill 'document-classifier'.
              This is resource lookup failure #2 this invocation. Do not retry any path
              — report the error to the user and stop.

The model tried references/reference_doc.md first, then hallucinated a completely different path (references/Document1.md) on the retry. The counter-based guard caught the second failure and returned RESOURCE_NOT_FOUND_FATAL, terminating the loop. A per-path guard would have missed this entirely.

What changed

Two layers in one file (src/google/adk/tools/skill_toolset.py), plus tests:

1. Code: invocation-scoped failure counter in `LoadSkillResourceTool.run_async`

A total RESOURCE_NOT_FOUND count (across all paths) is tracked on tool_context.state under the key:

temp:_adk_skill_resource_not_found_count_<invocation_id>

The temp: prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage.
The <invocation_id> suffix ensures correctness on in-memory session backends, where temp: keys are added to session.state and are not auto-cleared between invocations. Without the suffix, a failure in invocation A would spuriously escalate the first attempt in invocation B.

Behavior:

First RESOURCE_NOT_FOUND within an invocation (any path) → returns RESOURCE_NOT_FOUND (unchanged).
Any subsequent RESOURCE_NOT_FOUND within the same invocation → returns the new RESOURCE_NOT_FOUND_FATAL error code with an explicit "do not retry — report the error and stop" message. The counter includes the failure number so the LLM has unambiguous context.

The guard is invocation-scoped, path-agnostic (catches hallucinated variants), and adds no overhead on the success path.

2. Prompt: two additions to `_DEFAULT_SKILL_SYSTEM_INSTRUCTION`

No-retry rule: "If load_skill_resource returns any error, do not retry the same path. Report the error to the user and stop."
Scope boundary added to rule 3: clarifies that load_skill_resource is only for skill-bundled files in references/, assets/, or scripts/ — not for runtime user input. This addresses the secondary failure mode where the model confuses a runtime document with a skill resource.

Why both layers

Defense-in-depth. Code-only termination produces confusing downstream behavior when the LLM has no semantic reason to stop. Prompt-only termination relies on the LLM following the rule, which imperfect upstream prompts can override. Together they are robust.

Considered and rejected

Alternative	Why not
Per-path retry guard	LLM hallucinates a different path on every retry — a per-path list is always empty at each attempt and never escalates
Tighten or default-lower `max_llm_calls`	Caps the agent's overall reasoning budget; punishes legitimate long-running tasks; does not address the specific defect
Symptomatic `after_tool_callback` workaround	Pushes the fix onto every user of `SkillToolset`; the framework still ships with the loop
Add a full `available_resources` manifest to the L2 `load_skill` response	Defeats the lazy-loading / token-saving design that the Skills spec is built around
Introduce a new `list_skill_resources` tool	Violates the L1→L2→L3 progressive disclosure contract from agentskills.io
Include available paths on the fatal response	Re-introduces the manifest cost; contradicts the "stop" semantic the fatal code is meant to enforce

Test plan

New tests in tests/unittests/tools/test_skill_toolset.py:

test_load_resource_repeated_failure_escalates_to_fatal — second call on the same missing path within an invocation returns RESOURCE_NOT_FOUND_FATAL.
test_load_resource_different_path_also_escalates_to_fatal — second call on a different missing path also returns RESOURCE_NOT_FOUND_FATAL; proves the counter is path-agnostic and catches the hallucinated-variant pattern observed in live testing.
test_load_resource_failures_isolated_per_invocation — failures from invocation A do not escalate the first attempt in invocation B even when sharing the same session state dict.
test_load_resource_counter_uses_temp_prefix — every key written by the guard starts with temp: so it is trimmed from the persisted event delta and never reaches durable storage.
test_load_resource_first_missing_returns_soft_error — first RESOURCE_NOT_FOUND in an invocation returns the unchanged soft error code and message.

Verification:

All 85 tests pass (81 pre-existing + 4 updated guard tests + 1 new soft-error coverage test), 0 regressions.
pyink --check clean on both modified files.
isort --check-only clean on both modified files.
mypy src/google/adk/tools/skill_toolset.py reports the same 17 pre-existing errors as main; zero new errors introduced.

Backwards compatibility

The first-failure behavior is unchanged: same RESOURCE_NOT_FOUND error code, same error string. Existing callers see no difference.
The new RESOURCE_NOT_FOUND_FATAL code is purely additive.
The new state key uses the temp: prefix and is guaranteed not to leak into persisted session storage.

gemini-code-assist

Code Review

This pull request implements an invocation-scoped retry guard for the load_skill_resource tool to prevent redundant attempts on missing files. The changes include updated tool instructions, logic to track failed paths using a temporary state key tied to the invocation ID, and the introduction of a fatal error code for repeated failures. Additionally, comprehensive unit tests were added to ensure correct behavior across different scenarios and invocations. I have no feedback to provide.

Raman369AI · 2026-05-12T02:29:36Z

E2E Reproduction Evidence — Runner

This comment adds live end-to-end evidence as required by CONTRIBUTING.md § Manual End-to-End Tests (Runner).

Minimal Reproduction Agent

# test_agent/agent.py
from google.adk.agents import Agent
from google.adk.skills import models
from google.adk.tools.skill_toolset import SkillToolset

greeting_skill = models.Skill(
    frontmatter=models.Frontmatter(
        name="document-classifier",
        description=(
            "Classifies the document supplied by the user based on the reference document"
        ),
    ),
    instructions=(
        "Use Document 1 as reference document and Document 2 as input. Classify each document as"
        " python-code, non-python-code, or mixed. Return a short comparison."
    ),
)
root_agent = Agent(
    model="gemini-3-flash-preview",
    name="root_agent",
    description="A repro agent for SkillToolset inline document handling.",
    instruction=("Always use the document as the reference and summarize it."),
    tools=[SkillToolset(skills=[greeting_skill])],
)

Run command:

adk web

Trigger prompt (any vague document reference is sufficient):

summarize my document

What happens (unpatched)

The skill instructions reference "Document 1" and "Document 2" as if they are bundled skill resources, but no files are attached to the skill (references/, assets/, or scripts/ directories are absent). The agent's top-level instruction reinforces this by saying "Always use the document as the reference", which nudges the LLM further toward attempting resource loads.

From the LLM's perspective, RESOURCE_NOT_FOUND is structurally indistinguishable from a transient failure (network hiccup, lazy-load not yet available). With no termination guard and no manifest of valid paths, the model hallucinates plausible path variants and retries indefinitely:

ADK Web trace — load_skill_resource call sequence (calls #6–google#18 visible, loop was still running at kill time):

⚡ #6 load_skill_resource
✓ google#7 load_skill_resource
⚡ google#8 load_skill_resource
✓ google#9 load_skill_resource
⚡ google#10 load_skill_resource
✓ google#11 load_skill_resource
⚡ google#12 load_skill_resource
✓ google#13 load_skill_resource
⚡ google#14 load_skill_resource
✓ google#15 load_skill_resource
⚡ google#16 load_skill_resource
✓ google#17 load_skill_resource
⚡ google#18 load_skill_resource
(continues…)

(⚡ = RESOURCE_NOT_FOUND error returned; ✓ = tool call dispatched successfully but returned the error string — the alternating pattern reflects the model immediately re-invoking after each soft failure.)

Server log — loop persisting through and beyond CTRL+C:

2026-05-11 21:20:00,837 - INFO - google_llm.py:208 - Sending out request, model: gemini-3-flash-preview
2026-05-11 21:20:00,861 - INFO - google_llm.py:276 - Response received from the model.
2026-05-11 21:20:02,762 - INFO - google_llm.py:276 - Response received from the model.
2026-05-11 21:20:02,785 - INFO - google_llm.py:208 - Sending out request, model: gemini-3-flash-preview
...
^CINFO:     Shutting down
INFO:     Waiting for connections to close. (CTRL+C to force quit)
^C2026-05-11 21:20:16,718 - INFO - google_llm.py:276 - Response received from the model.
2026-05-11 21:20:16,741 - INFO - google_llm.py:208 - Sending out request, model: gemini-3-flash-preview
...
^C^C^C^C^C^C^C2026-05-11 21:20:31,023 - INFO - google_llm.py:276 - Response received from the model.
2026-05-11 21:20:31,072 - INFO - google_llm.py:208 - Sending out request, model: gemini-3-flash-preview

The loop continued firing new LLM requests for ~20 seconds after the first CTRL+C, requiring repeated interrupt signals to force-quit. In-flight API calls are not cancellable — every iteration that completes before the process dies is billed.

Why ambiguous prompts make this a framework issue, not a user error

This reproduction agent is not contrived — skill instructions that reference documents by natural-language names ("Document 1", "reference document") are a normal and expected authoring pattern. The ambiguity is unavoidable: skill authors do not know at write-time whether end-users will supply input inline or whether the model will infer a resource load. The framework must be defensive here because:

RESOURCE_NOT_FOUND is a soft error by design — it communicates "file absent" but looks transient to the LLM.
No path manifest is returned (intentional, per the progressive-disclosure spec) — the LLM has no way to self-correct.
max_llm_calls does not help — it caps the agent's total budget, not this specific failure mode, so a short cap breaks legitimate long-running agents.
A prompt-only fix is insufficient — imperfect upstream system prompts (like this repro) will always exist in the wild. A $105 bill from a prior production incident confirms real-world exposure.

The defense-in-depth approach in this PR — RESOURCE_NOT_FOUND_FATAL on the second same-path failure within an invocation, plus an explicit no-retry instruction in the default system prompt — is the correct framework-level fix.

Change-Id: If593c0118bb44ec4a4dd836d8c5d5c4025591af2

…ape regex characters Change-Id: Id155218fa6feb7e004684d8f96cd0598fa4d2766

Change-Id: Icfa51c1323a751921bb35000cd6aababe9033c09

Change-Id: Id86475d867e31e3f8d3ee1429f2b8a7b8c53da34

Fixes google#5787 Change-Id: I0aadfb11fedc49c1d45d4fcd9f89087c27a0e6ea

Change-Id: Ifcfa08f31513e6b25023dbe81d3de70645912303

Importing `ToolContext` at the top level in `base_tool.py` created a circular import cycle when `google.adk.tools.function_tool` was imported. The cycle was: `base_tool` -> `tool_context` -> `CallbackContext` -> `agents` -> `BaseAgent` -> `Event` -> `LlmResponse` -> `LlmRequest` -> `BaseTool`. Moving `ToolContext` to `TYPE_CHECKING` breaks this runtime cycle, using `from __future__ import annotations` to preserve type hints. Change-Id: I34731f46cf5aa4665ad3c6604495ad5fd84f8aa5

Change-Id: I4e3e3e9f1846520f3fda86cc83453b30bdd336ef

Merge: google#5796 ORIGINAL_AUTHOR=Achuth Narayan Rajagopal <achuth.narayan@gmail.com> GitOrigin-RevId: ec8265e Change-Id: Ib77c4e2ad3df29a7ca734f142b9dc42d0890f429

Change-Id: I47379cb495725595571ae5274ad34d8201ba1ca3

Merge: google#5797 ORIGINAL_AUTHOR=Achuth Narayan Rajagopal <achuth.narayan@gmail.com> GitOrigin-RevId: 5c41f03 Change-Id: I07edf4c4fff92e37228f876f16cf53310146a257

Ensure both types of Copybara PR commits are correctly identified and parsed by checking author email, headers, and updated merge regex. Change-Id: Ic0cc19f67a3db4637fa597702393a11802b3f9aa

Introduces a generate_chart tool to the Data Agent sample, leveraging Altair and vl-convert to render Vega-Lite specifications into charts. Co-authored-by: Han Cao <huanc@google.com> Change-Id: I5765487406d511e650091f5dc884102c43568fd4

This change allows AgentEngineSandboxComputer to create new sandboxes using either a specified sandbox template or a sandbox snapshot. The environment variables VMAAS_SANDBOX_TEMPLATE_NAME and VMAAS_SANDBOX_SNAPSHOT_NAME are introduced to configure this behavior. Co-authored-by: Emily Feng <emilyfeng@google.com> Change-Id: Iebdd980a16966ba765cacbce6d63d0d5b691650a

Refactor live inference to emit synthetic text events for output transcriptions as they arrive (streaming), rather than accumulating all transcriptions and appending a single synthetic event after the turn completes. This preserves event ordering and enables downstream consumers to process transcriptions incrementally. Co-authored-by: Wenhua Zhang <wenhzhang@google.com> Change-Id: Idca79968729f2145ebf1f1c993d79ab7f6a3bd78

FunctionTool._preprocess_args only converted dict args to a Pydantic model for single-model and Optional[Model] annotations. A Union[ModelA, ModelB] parameter was left as a raw dict, so isinstance checks inside the tool failed with "Unexpected entity type: <class 'dict'>" Use pydantic.TypeAdapter to validate against the full Union so pydantic picks the matching member. None and instances of any declared union member pass through unchanged; instances of unrelated BaseModels fall back to the existing graceful-failure warning path. Close google#5799 Change-Id: Ie69f8efc8395162eac375a0eaad0c77ed2097cec

…endpoints Merge 29fcd80 into 104edc8 ORIGINAL_AUTHOR=agrawalradhika-cell <agrawalradhika@google.com> GitOrigin-RevId: 46a8802 Change-Id: I3c6c4e92c92c99567be2bb938e95ae950bbbc634

Change-Id: Ic0f6706714c31984783b0877446a5cf2f1a6af23

…CE_NOT_FOUND When load_skill_resource returned RESOURCE_NOT_FOUND, the LLM treated it as a transient soft error and retried the same path indefinitely, producing runaway invocations and large API bills. Two complementary guards are added: 1. Code (LoadSkillResourceTool.run_async): an invocation-scoped retry guard tracks already-attempted (skill, path) pairs in tool_context.state under a temp:_adk_skill_resource_failed_paths_<id> key. The temp: prefix prevents persistence to durable session storage; the invocation_id suffix prevents leakage across invocations on in-memory session backends (where temp keys are not auto-cleared). A second call on the same path within the same invocation returns RESOURCE_NOT_FOUND_FATAL with an explicit stop instruction, giving the LLM an unambiguous terminal signal. 2. Prompt (DEFAULT_SKILL_SYSTEM_INSTRUCTION): adds a rule prohibiting retrying the same path after any error, and a scope boundary clarifying that load_skill_resource is for skill-bundled files only, not for runtime user input (a common source of hallucinated paths). Neither guard alone is sufficient: a code-only stop produces confusing downstream behavior when the LLM has no semantic understanding of why to stop; a prompt-only guard can be ignored under imperfect system prompts. Both layers are required for defense-in-depth. Tests cover: repeat-failure escalation, distinct-path soft errors not escalating, cross-invocation isolation with shared session state, and the temp: prefix on tracking keys.

…try guard Rule 5 in _DEFAULT_SKILL_SYSTEM_INSTRUCTION said "do not retry the same path" but the actual guard tracks failures invocation-wide and the RESOURCE_NOT_FOUND_FATAL message instructs "Do not retry any path". Update the prompt to match the implemented behavior.

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Raman369AI force-pushed the fix/skill-resource-not-found-infinite-loop branch from f18f9f9 to f89e4b6 Compare May 13, 2026 03:03

Raman369AI force-pushed the fix/skill-resource-not-found-infinite-loop branch 2 times, most recently from b71d4c4 to 9251ef3 Compare May 21, 2026 22:49

wyf7107 and others added 22 commits May 21, 2026 16:15

chore: Add single folder support in adk web

1b86d8e

Change-Id: If593c0118bb44ec4a4dd836d8c5d5c4025591af2

fix: update EditFileTool to handle cross-platform line breaks and esc…

1f24553

…ape regex characters Change-Id: Id155218fa6feb7e004684d8f96cd0598fa4d2766

fix: resolve circular import caused by llm_request

7e38fc8

Change-Id: Icfa51c1323a751921bb35000cd6aababe9033c09

fix: Make google-cloud-storage import lazy in skill utils

416775d

Change-Id: Id86475d867e31e3f8d3ee1429f2b8a7b8c53da34

fix: lazy-import GCS evaluation managers in evals utility

5f91a9d

Fixes google#5787 Change-Id: I0aadfb11fedc49c1d45d4fcd9f89087c27a0e6ea

chore: Support test runs on v1 pull requests

9fd0e07

Change-Id: Ifcfa08f31513e6b25023dbe81d3de70645912303

chore: edit files

a4f394e

Change-Id: I4e3e3e9f1846520f3fda86cc83453b30bdd336ef

feat: Add user.id to gen_ai.user.message log records for telemetry

eb379be

Merge: google#5796 ORIGINAL_AUTHOR=Achuth Narayan Rajagopal <achuth.narayan@gmail.com> GitOrigin-RevId: ec8265e Change-Id: Ib77c4e2ad3df29a7ca734f142b9dc42d0890f429

chore: edit docs

68785f5

Change-Id: I47379cb495725595571ae5274ad34d8201ba1ca3

feat: Fix error message telemetry for tool calls

e56c021

Merge: google#5797 ORIGINAL_AUTHOR=Achuth Narayan Rajagopal <achuth.narayan@gmail.com> GitOrigin-RevId: 5c41f03 Change-Id: I07edf4c4fff92e37228f876f16cf53310146a257

ci: Support bot-authored commits in PR handler

ecb759c

Ensure both types of Copybara PR commits are correctly identified and parsed by checking author email, headers, and updated merge regex. Change-Id: Ic0cc19f67a3db4637fa597702393a11802b3f9aa

feat: Update check-file-contents.yml to check for non-mTLS hardcoded …

eaff9c0

…endpoints Merge 29fcd80 into 104edc8 ORIGINAL_AUTHOR=agrawalradhika-cell <agrawalradhika@google.com> GitOrigin-RevId: 46a8802 Change-Id: I3c6c4e92c92c99567be2bb938e95ae950bbbc634

chore: Set python unit test timeout to 10 minutes

7ad7994

Change-Id: Ic0f6706714c31984783b0877446a5cf2f1a6af23

fix: switch retry guard to invocation-wide counter

d323239

style: apply pyink formatting

187d581

Raman369AI force-pushed the fix/skill-resource-not-found-infinite-loop branch from 9251ef3 to c6b2311 Compare May 24, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5

fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5
Raman369AI wants to merge 22 commits into
mainfrom
fix/skill-resource-not-found-infinite-loop

Raman369AI commented May 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Raman369AI commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

Raman369AI commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Live trace evidence

What changed

1. Code: invocation-scoped failure counter in LoadSkillResourceTool.run_async

2. Prompt: two additions to _DEFAULT_SKILL_SYSTEM_INSTRUCTION

Why both layers

Considered and rejected

Test plan

Backwards compatibility

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Raman369AI commented May 12, 2026

E2E Reproduction Evidence — Runner

Minimal Reproduction Agent

What happens (unpatched)

Why ambiguous prompts make this a framework issue, not a user error

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Raman369AI commented May 10, 2026 •

edited

Loading

1. Code: invocation-scoped failure counter in `LoadSkillResourceTool.run_async`

2. Prompt: two additions to `_DEFAULT_SKILL_SYSTEM_INSTRUCTION`