“I've failed 17 consecutive times with the exact same error”

Dec 09, 2025

That was the verbatim response from our agent as it attempted to execute a coding task. Actually, it was a little more verbose in its self-scolding:

I’ve failed 17 consecutive times with the exact same error. I keep calling replace_file with only the file_path parameter and never include the content parameter.
After 17 consecutive failures, I need to break this pattern. Let me think about what’s happening:
I’m not actually constructing the CSS content in the tool call
I keep saying I will, but then I don’t do it
I need to ACTUALLY write the complete CSS inside the content parameter
In my next response, I MUST write the COMPLETE tool call with BOTH parameters, including the actual CSS code in the content parameter. I cannot make this mistake an 18th time.

Spoiler alert: it would repeat the mistake an 18th time. Even though it knew exactly what it was doing wrong. The strangest part wasn’t the failure; it was the self-awareness.

while (true)

Tonkotsu is a multi-agent management environment — the successor to IDEs. A core part of it is a coding agent that executes engineering tasks in parallel and without need for micromanagement. The coding agent uses an LLM (Claude Sonnet mostly) and a set of coding tools focused on reading and writing to a git repo. The LLM is given a task specification and then calls tools over and over (to read relevant parts of the repo, make code edits, then run tools to validate) until its task is accomplished. Pretty standard coding agent architecture.

We track task failures in a daily review to make sure agent reliability and generated code quality meet high standards. We get to see LLM behavior at the edges, where things either perform shockingly well or fail in very bizarre ways. Starting in September, we saw that a large percentage of our task failures were because the LLM session exceeded a limit we had on the maximum number of messages. Upon inspection of these failing tasks, we could see that the LLM had fallen into an infinite loop of calling a tool unsuccessfully, then calling that same tool in the same erroneous way over and over (often 30-40 times), until the limit was hit.

We have a replace_file tool that allows the LLM to overwrite an existing file (or create a new file) at file_path with text provided in content . Both parameters are identified as required.

{
  name: “replace_file”,
  description: “Write a file to the local filesystem. Overwrites the existing file if there is one.”,
  input_schema: {
    type: “object”,
    properties: {
      file_path: {
        type: “string”,
        description: “Path to the file to replace or create”
      },
      content: {
        type: “string”,
        description: “New content for the file”
      }
    },
    required: [”file_path”, “content”]
  }
}

In the failing tasks, the LLM repeatedly called replace_file with a valid file_path but no content at all! And once it made a bad call, it would spiral into an infinite loop, calling replace_file over and over in exactly the same way and never specifying content.

break;

Our initial mitigation was simple and direct. When receiving a bad tool call, we started returning a more verbose error message to the LLM, explicitly naming the parameter that was missing and clearly instructing it to think about the value of that parameter before making the call again. The fix was deployed and we found it had no observable effect at all — our first hint that this wasn’t just a run-of-the-mill mistake.

Next, we tried a stronger intervention. When a bad tool call was made, we would disable tool calling entirely in the next LLM turn. We’d explicitly tell the model via a user message that tool calling was disabled, that the function call was missing a parameter, and that it should reflect on what the content of that parameter should be. The model would respond with an assistant text message (not tool call) with its thinking, and then we would re-enable tool calls on the subsequent turn. This was a much more invasive approach, pausing the entire trajectory to give the model a chance to think deeply.

And it still didn’t work. The model would indeed think through what needed to be done, often with a dose of self-flagellation as in the quote at the start of this article. The fascinating thing: it knew precisely what was wrong. It could textually describe how to fix it. But then, on the next tool-call-enabled turn, it would immediately repeat the malformed call.

At some point, it also started talking about internal implementation details, suggesting that internally the model emits tool calls as XML:

The issue is clear: I keep writing:

<invoke name=\\\\”replace_file\\\\”>
<parameter name=\\\\”file_path\\\\”>styles/styles.css</parameter>
</invoke>

But I MUST write:

<invoke name=\\\\”replace_file\\\\”>
<parameter name=\\\\”file_path\\\\”>styles/styles.css</parameter>
<parameter name=\\\\”content\\\\”>THE ACTUAL CSS CODE HERE</parameter>
</invoke>

We had stumbled upon some strange, deep-seated behavior of the model. We speculated that the behavior was an artifact of the model’s training that demonstrated the value of retrying tool calls, and once it latched onto this failing tool call pattern, it kept sampling the same tool call sequence again and again. It had fallen into a gravity well so strong that not only could it not correct the tool call, it also couldn’t formulate any other strategy as a workaround.

At this point, we were stumped. Unsure exactly how to proceed, we kept experimenting and also sought the advice of the Anthropic team. They suggested a tweak to our intervention approach by providing the LLM the exact JSON template for the function call and asking it to fill it out during its tool-calls-disabled reflection turn. We didn’t expect much of this simple tweak but added it to our battery of experiments. We would now add this static prompt to our reflection instruction to the model:

Generate the following JSON object to represent the correct tool call with real parameter values for replace_file. Conform to exactly this JSON structure:

  {
    ‘type’: ‘tool_use’,
    ‘name’: ‘replace_file’,
    ‘input’: {
      ‘file_path’: <FILE_PATH_HERE>,
      ‘content’: <CONTENT_HERE>
    }
  }

Shockingly, this simple tweak resulted in significant improvements! The model still occasionally generates incorrect tool calls, but is able to recover rather than spiral into an infinite loop — a much better result. In yet another bizarre aspect of the model’s behavior, this explicit JSON structure was enough to help the model climb out of the gravity well of the tool call loop.

More recently, Anthropic released strict tool use, which should guarantee correct tool calls. We’re currently experimenting with this as well.

Parallel > Perfect

What’s striking is how familiar this all feels if you’ve ever been an engineering manager or even just an observant member of a team. You’ve probably worked with someone who:

Repeats the same unproductive action in the face of increasingly explicit feedback
Is generally quite reasonable, but gets bizarrely stubborn on one issue
Can verbalize the solution to a problem, but simply can’t execute it

Humans do this, and so do LLMs. Our bet is that the future isn’t perfect coworkers (agent or human); it’s the ability to effectively coordinate them all together to solve a big problem in parallel.

More than Vibes

Discussion about this post

Ready for more?