There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges.
-
This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.
I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.
LLMs broke CTFs.
Asahi Linya (朝日りにゃ〜), I really hope that LLMs are a temporary phenomenon. Sure the local ones will remain even after the bubble finally bursts, but they're ridiculously bad, you do need millions of dollars worth of GPUs to get to that "it's still bad but it looks plausible" level of output quality.
-
There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.
We're screwed.
At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)
But that doesn't matter, because it found it.
The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.
(Continued)
@lina I'm a geek... I like AI and all of that... but if I understood your post right, it's "complaining" of the consequences of the capabilities it provides and that reminds me of MMORPGs a long time ago where you could marvel at the deeds of someone while now, it's just google the setup and technique and just reproduce it... basically, humans are becoming less the center of intelligence and more cows following a line
-
R relay@relay.an.exchange shared this topic
-
There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.
We're screwed.
At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)
But that doesn't matter, because it found it.
The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.
(Continued)
@lina they're engineering their self-incapacitation. Or decapacitation i suppose, because they get flush some skills down the drain to do that.
-
@abacabadabacaba It's much easier to parallel construct a CTF solution than a programming challenge. CTF challenges are all about having a series of realizations that lead to the answer.
If you ban LLMs in a programming challenge, you could conceivably detect signs of LLM usage in the program in various ways (not perfectly, but you could try). A CTF challenge just has one output, the flag. Everyone finds the same flag. There is no way to tell how you did it. You'd have to introduce invasive monitoring like online tests, and even if you record people's screens, they could easily be running an LLM on another machine to have it come up with the "key points" to the solution which you just implement. You can't prove that someone didn't have some ideas on their own.
@lina There are programming competitions where participants run their solutions locally and submit the output. But they are usually also required to submit the code, even though it is not automatically judged. If cheating is suspected, the judges may look into the code. Also there may be automated checks for plagiarism etc. CTFs could do the same. There really isn't a good reason to keep solutions secret after the challenge concludes, and published solutions can serve as a learning material for future challenges.
-
@nathan It's worse because it's not a linear game like chess. You aren't competing move-wise, you are going down your own path where there is no interaction between teams. There's no way to detect that in online competition, even heuristically. There's no realtime monitoring. There isn't any condensed format that describes "what you did". At most you could stream yourself to some kind of video escrow system, but then who is going to watch those? And if you make them public after the competition, you are giving away your tools to everyone. And you could still have an LLM on the side on another machine and parallel construct the whole thing plausibly.
Sure you could do in-person only, but that would only work for the top tiers and who is going to want to learn and grow online when a huge number of people are going to be cheating online?
It's the same with any kind of game. Sure cheating is barely a concern in-person, but people hate cheaters online, and companies still try hard to detect cheaters. And detecting cheaters for a CTF is nigh impossible.
@lina Ah I didn't consider that there would be a culture of hiding tools/methods. Yeah that's definitely incompatible with a post-LLM world.
This is a general trend with GenAI: the only way to earn legitimacy is either in person, or by publicizing the creative process. For a while already visual/music artists have had to either rely on their existing credibility, or share their creative process to establish their art's legitimacy. New anonymous art has sadly been made nearly worthless.
-
There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.
We're screwed.
At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)
But that doesn't matter, because it found it.
The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.
(Continued)
@lina@vt.social To be fair I'd argue this is strictly a people problem
I feel like this is the inherent nature of competition in places where cooperation would make much more sense
And this issue permeates so many areas that the world is more preoccupied with catching the people cheating the system instead of going "hey maybe this system could incentivize actually getting invested into the thing instead of being a pure so-called meritocracy " -
There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.
We're screwed.
At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)
But that doesn't matter, because it found it.
The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.
(Continued)
@lina
CTF = Capture the Flag, in case that helps anyone besides meI try to do for initialisms and acronyms what alt text does for images.
Wikipedia: In computer security, Capture the Flag (CTF) is an exercise in which participants attempt to find text strings, called "flags", which are secretly hidden in purposefully vulnerable programs or websites
-
I might still do a monthly challenge or something in the future so people who want to have fun and learn can have fun and learn. That's still okay.
But CTFs as discrete competitions with winners are dead.
A CTF competition is basically gameified homework.
LLMs broke the game. Now all that's left is self study.
@lina I wonder if you can still design a challenge to be "LLM unfriendly" by changing the wording, just like those papers showing how an LLM aces problems like "river crossing", but if you change wording a bit, they just fail in weird and spectacular ways.
-
@lina I wonder if you can still design a challenge to be "LLM unfriendly" by changing the wording, just like those papers showing how an LLM aces problems like "river crossing", but if you change wording a bit, they just fail in weird and spectacular ways.
@doragasu Possibly? I might try removing all "hints" from one and trying again and seeing if it's any different. But that also affects human solvers... the hints are there to point you towards a website that explains the fundamentals of what's going on. The LLM didn't even read that, it just guessed from a filename and a comment and hulk smashed its way to guessing the general concept right with multiple attempts...
-
@doragasu Possibly? I might try removing all "hints" from one and trying again and seeing if it's any different. But that also affects human solvers... the hints are there to point you towards a website that explains the fundamentals of what's going on. The LLM didn't even read that, it just guessed from a filename and a comment and hulk smashed its way to guessing the general concept right with multiple attempts...
@lina In those papers trying to confuse LLMs, what was very effective IIRC, was adding data you don't need to use to the statement. The LLM tried to use all data you gave it to solve the problem and fail. Just like when a child is solving maths problems from a text book, all problems look similar so the child internalizes that you have to add two numbers and divide by the third one. Then you change the problem and the child fails because applies the same "formula".
-
@lina In those papers trying to confuse LLMs, what was very effective IIRC, was adding data you don't need to use to the statement. The LLM tried to use all data you gave it to solve the problem and fail. Just like when a child is solving maths problems from a text book, all problems look similar so the child internalizes that you have to add two numbers and divide by the third one. Then you change the problem and the child fails because applies the same "formula".
@lina Like in here: https://arxiv.org/abs/2305.04388
-
@lina Like in here: https://arxiv.org/abs/2305.04388
@lina Or better this one: https://arxiv.org/abs/2410.05229

-
@grishka FYI your instance seems to have a very old display name cached for me (that it is using for mentions) ;;
-
@lina Ah I didn't consider that there would be a culture of hiding tools/methods. Yeah that's definitely incompatible with a post-LLM world.
This is a general trend with GenAI: the only way to earn legitimacy is either in person, or by publicizing the creative process. For a while already visual/music artists have had to either rely on their existing credibility, or share their creative process to establish their art's legitimacy. New anonymous art has sadly been made nearly worthless.
@nathan I don't think there's necessarily a culture of hiding methods outright (though some of the more competitive teams might), but more like people build their own personal stash of scripts and things to build off of, and don't necessarily just outright post it on GitHub or whatever.
So like, "fucky stuff with QR codes" having showed up in CTF challenges more than once, I have a personal "do low level analysis and extended recovery of damaged QR codes" script.
-
@lina There are programming competitions where participants run their solutions locally and submit the output. But they are usually also required to submit the code, even though it is not automatically judged. If cheating is suspected, the judges may look into the code. Also there may be automated checks for plagiarism etc. CTFs could do the same. There really isn't a good reason to keep solutions secret after the challenge concludes, and published solutions can serve as a learning material for future challenges.
@abacabadabacaba The thing is the solution isn't "the code". The solution is the process. You can have an LLM "solve" it for you, then rewrite the process and cheat that way. Yes the solution will often involve some bespoke scripts and tooling, but that's just part of it. The "aha moments", that you can't provide proof of.
-
@grishka FYI your instance seems to have a very old display name cached for me (that it is using for mentions) ;;
Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!, yeah I only automatically reload actors when I receive activities from them and more than 24 hours has passed since the previous reload. Now that you've sent me a reply, it did trigger that. Maybe I should do the same when fetching things like a post that someone boosted.
-
Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!, yeah I only automatically reload actors when I receive activities from them and more than 24 hours has passed since the previous reload. Now that you've sent me a reply, it did trigger that. Maybe I should do the same when fetching things like a post that someone boosted.
@grishka Yeah I think that name was possibly a year+ old ^^;;
-
And honestly, reading the Claude output, it's just ridiculous. It clearly has no idea what it's doing and it's just pattern-matching. Once it found the flag it spent 7 pages of reasoning and four more scripts trying to verify it, and failed to actually find what went wrong. It just concluded after all that time wasted that sometimes it gets the right answer and sometimes the wrong answer and so probably the flag that looks like a flag is the flag. It can't debug its own code to find out what actually went wrong, it just decided to brute force try again a different way.
It's just a pattern-matching machine. But it turns out if you brute force pattern-match enough times in enough steps inside a reasoning loop, you eventually stumble upon the answer, even if you have no idea how.
Humans can "wing it" and pattern-match too, but it's a gamble. If you pattern-match wrong and go down the wrong path, you just wasted a bunch of time and someone else wins. Competitive CTFs are all about walking the line between going as fast as possible and being very careful so you don't have to revisit, debug, and redo a bunch of your work. LLMs completely screw that up by brute forcing the process faster than humans.
This sucks.
AI is fast eradicating any learning activity.
In my current job, learning anything new is actively discouraged.As was said to us "they only care about numbers on a dashboard".
I got to the position I am in, at the level at I am in, by being curious and very interested, in taking things apart, and figuring out how they work.
A LLM, which, in the eyes of a CEO means he can get rid of people like me, is the end of the road, we are all doomed.
-
@lina@vt.social To be fair I'd argue this is strictly a people problem
I feel like this is the inherent nature of competition in places where cooperation would make much more sense
And this issue permeates so many areas that the world is more preoccupied with catching the people cheating the system instead of going "hey maybe this system could incentivize actually getting invested into the thing instead of being a pure so-called meritocracy "@natty But the whole point of a for-fun(/prize) competition is to use the gamification to motivate people... that's kind of what games are?
You don't strictly need it, you can publish challenges to be solved for no points and no prize... but that demonstrably does not get as many people interested. Between people for whom that works, and the "I just want to win" people who would use LLMs, there are people who would be motivated to compete but not just self-study, and you lose those when the LLM cheaters come in.
-
@lina I do feel like this is about how you use the LLM. I often find my self throwing something into my local llama to give me an ELI5 or what do these flags on this command do in combination.
But as someone who has Designed CTFs and watched someone fling through it without learning a damn thing, it can be hard to keep the faith.
When I took physics all those years ago my professor made us learn a slide rule before a calculator. If you skip over the basics and use a machine to do it..when the machine breaks or is wrong, who is gonna fix it and how?
@ahasty But at least a calculator is always right. I have no problem with people using tools that can be understood and are reliable/engineered.
The problem is LLMs are not that. They cannot be understood, they are black boxes that just brute force their way through things. So they are particularly and uniquely toxic in the harm they cause, compared to the tools we've had until now as part of the general industrial/technology revolution.