Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges.

There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges.

Scheduled Pinned Locked Moved Uncategorized
63 Posts 33 Posters 74 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • lina@vt.socialL lina@vt.social

    This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

    I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

    LLMs broke CTFs.

    lina@vt.socialL This user is from outside of this forum
    lina@vt.socialL This user is from outside of this forum
    lina@vt.social
    wrote last edited by
    #8

    And honestly, reading the Claude output, it's just ridiculous. It clearly has no idea what it's doing and it's just pattern-matching. Once it found the flag it spent 7 pages of reasoning and four more scripts trying to verify it, and failed to actually find what went wrong. It just concluded after all that time wasted that sometimes it gets the right answer and sometimes the wrong answer and so probably the flag that looks like a flag is the flag. It can't debug its own code to find out what actually went wrong, it just decided to brute force try again a different way.

    It's just a pattern-matching machine. But it turns out if you brute force pattern-match enough times in enough steps inside a reasoning loop, you eventually stumble upon the answer, even if you have no idea how.

    Humans can "wing it" and pattern-match too, but it's a gamble. If you pattern-match wrong and go down the wrong path, you just wasted a bunch of time and someone else wins. Competitive CTFs are all about walking the line between going as fast as possible and being very careful so you don't have to revisit, debug, and redo a bunch of your work. LLMs completely screw that up by brute forcing the process faster than humans.

    This sucks.

    lina@vt.socialL sonic2k@oldbytes.spaceS curtmack@floss.socialC dngrs@chaos.socialD alice@lgbtqia.spaceA 5 Replies Last reply
    0
    • lina@vt.socialL lina@vt.social

      And honestly, reading the Claude output, it's just ridiculous. It clearly has no idea what it's doing and it's just pattern-matching. Once it found the flag it spent 7 pages of reasoning and four more scripts trying to verify it, and failed to actually find what went wrong. It just concluded after all that time wasted that sometimes it gets the right answer and sometimes the wrong answer and so probably the flag that looks like a flag is the flag. It can't debug its own code to find out what actually went wrong, it just decided to brute force try again a different way.

      It's just a pattern-matching machine. But it turns out if you brute force pattern-match enough times in enough steps inside a reasoning loop, you eventually stumble upon the answer, even if you have no idea how.

      Humans can "wing it" and pattern-match too, but it's a gamble. If you pattern-match wrong and go down the wrong path, you just wasted a bunch of time and someone else wins. Competitive CTFs are all about walking the line between going as fast as possible and being very careful so you don't have to revisit, debug, and redo a bunch of your work. LLMs completely screw that up by brute forcing the process faster than humans.

      This sucks.

      lina@vt.socialL This user is from outside of this forum
      lina@vt.socialL This user is from outside of this forum
      lina@vt.social
      wrote last edited by
      #9

      I might still do a monthly challenge or something in the future so people who want to have fun and learn can have fun and learn. That's still okay.

      But CTFs as discrete competitions with winners are dead.

      A CTF competition is basically gameified homework.

      LLMs broke the game. Now all that's left is self study.

      mrdos@hachyderm.ioM coldclimate@hachyderm.ioC doragasu@mastodon.sdf.orgD 3 Replies Last reply
      0
      • lina@vt.socialL lina@vt.social

        I might still do a monthly challenge or something in the future so people who want to have fun and learn can have fun and learn. That's still okay.

        But CTFs as discrete competitions with winners are dead.

        A CTF competition is basically gameified homework.

        LLMs broke the game. Now all that's left is self study.

        mrdos@hachyderm.ioM This user is from outside of this forum
        mrdos@hachyderm.ioM This user is from outside of this forum
        mrdos@hachyderm.io
        wrote last edited by
        #10

        @lina For in-person CTF competitions, would it be possible to do like programming competitions (specifically thinking of ACM ICPC) and disallow Internet access entirely? That would at least limit GenAI use to local models, which I suspect will remain uncompetitive at this sort of task for a very long time (due to the nearly inherent context size limitations).

        lina@vt.socialL 1 Reply Last reply
        0
        • lina@vt.socialL lina@vt.social

          So it's not surprising that an LLM can solve them, because it automates the process. That just takes all the fun and all the learning out of it, completely defeating the purpose.

          I'm sure you could still come up with challenges that LLMs can't solve, but they would necessarily be harder, because LLMs are going to oneshot any of the "baby" starter challenges you could possibly come up with. So you either get rid of the "baby" challenges entirely (which means less experienced teams can't compete at all), or you accept that people will solve them with LLMs. But neither of those actually works.

          Since CTF competitions are pretty much by definition timed, speed is an advantage. That means a team that does not use LLMs will not win, so teams must use LLMs. This applies to both new and experienced teams. But: A newbie team using LLMs will not learn. Because the whole point is learning by doing, and you're not doing anything. And so will not become experienced.

          So this is going to devolve into CTFs being a battle of teams using LLMs to fight for the top spots, where everyone who doesn't want to use an LLM is excluded, and where less experienced teams stop improving and getting better, because they're outsourcing the work to LLMs and not learning as a result.

          pandro@fedi.imowl.netP This user is from outside of this forum
          pandro@fedi.imowl.netP This user is from outside of this forum
          pandro@fedi.imowl.net
          wrote last edited by
          #11

          @lina@vt.social
          That is (yet another) sad development to hear about.

          I just hope CTFs will be made/hosted regardless. I have never cared too much about the time and more about finishing them and it'd be a shame if that option of learning were to disappear or get way less accessible.

          1 Reply Last reply
          0
          • mrdos@hachyderm.ioM mrdos@hachyderm.io

            @lina For in-person CTF competitions, would it be possible to do like programming competitions (specifically thinking of ACM ICPC) and disallow Internet access entirely? That would at least limit GenAI use to local models, which I suspect will remain uncompetitive at this sort of task for a very long time (due to the nearly inherent context size limitations).

            lina@vt.socialL This user is from outside of this forum
            lina@vt.socialL This user is from outside of this forum
            lina@vt.social
            wrote last edited by
            #12

            @MrDOS Maybe, but in-person CTFs are themselves biased towards more privileged people and more advanced teams. We need online CTFs for the pipeline to work...

            mrdos@hachyderm.ioM 1 Reply Last reply
            0
            • lina@vt.socialL lina@vt.social

              @MrDOS Maybe, but in-person CTFs are themselves biased towards more privileged people and more advanced teams. We need online CTFs for the pipeline to work...

              mrdos@hachyderm.ioM This user is from outside of this forum
              mrdos@hachyderm.ioM This user is from outside of this forum
              mrdos@hachyderm.io
              wrote last edited by
              #13

              @lina Yeah, of course – and in reverse, I'm sure online competitive programming competitions are staring down the barrel of the same problem!

              1 Reply Last reply
              0
              • lina@vt.socialL lina@vt.social

                This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

                I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

                LLMs broke CTFs.

                nathan@mastodon.e4b4.euN This user is from outside of this forum
                nathan@mastodon.e4b4.euN This user is from outside of this forum
                nathan@mastodon.e4b4.eu
                wrote last edited by
                #14

                @lina

                How does this statement differ from "DeepBlue broke chess"? Cheat engines are similarly impossible to deterministically detect in online competition, yet the game is more popular than ever.

                The competition format will have to adapt, which sucks, but if the majority of participants can agree that LLMs are cheats, then the community should be able to adapt & self-police like any other game community where cheats are easily accessible. Unless I'm missing something special about CTFs?

                lina@vt.socialL 1 Reply Last reply
                0
                • nathan@mastodon.e4b4.euN nathan@mastodon.e4b4.eu

                  @lina

                  How does this statement differ from "DeepBlue broke chess"? Cheat engines are similarly impossible to deterministically detect in online competition, yet the game is more popular than ever.

                  The competition format will have to adapt, which sucks, but if the majority of participants can agree that LLMs are cheats, then the community should be able to adapt & self-police like any other game community where cheats are easily accessible. Unless I'm missing something special about CTFs?

                  lina@vt.socialL This user is from outside of this forum
                  lina@vt.socialL This user is from outside of this forum
                  lina@vt.social
                  wrote last edited by
                  #15

                  @nathan It's worse because it's not a linear game like chess. You aren't competing move-wise, you are going down your own path where there is no interaction between teams. There's no way to detect that in online competition, even heuristically. There's no realtime monitoring. There isn't any condensed format that describes "what you did". At most you could stream yourself to some kind of video escrow system, but then who is going to watch those? And if you make them public after the competition, you are giving away your tools to everyone. And you could still have an LLM on the side on another machine and parallel construct the whole thing plausibly.

                  Sure you could do in-person only, but that would only work for the top tiers and who is going to want to learn and grow online when a huge number of people are going to be cheating online?

                  It's the same with any kind of game. Sure cheating is barely a concern in-person, but people hate cheaters online, and companies still try hard to detect cheaters. And detecting cheaters for a CTF is nigh impossible.

                  nathan@mastodon.e4b4.euN 1 Reply Last reply
                  0
                  • lina@vt.socialL lina@vt.social

                    There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.

                    We're screwed.

                    At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)

                    But that doesn't matter, because it found it.

                    The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.

                    (Continued)

                    zrb@social.hildebrind.spaceZ This user is from outside of this forum
                    zrb@social.hildebrind.spaceZ This user is from outside of this forum
                    zrb@social.hildebrind.space
                    wrote last edited by
                    #16

                    @lina I feel exactly the same about academia. I dunno about anyone else, but I genuinely enjoyed learning new things in school, building new ways of thinking and new skills. Sharpening my mind was the best feeling in the world back then, when I was young and still had all that neuroplasticity. It's mystifying that anyone would rather subsume their thinking to the chatbot than build new skills.

                    But then again, lots of young adults in the US are ushered into university and told that it's their only option, even if they don't particularly care for the subjects they're ostensibly supposed to be learning. I was privileged enough that I didn't have to work through college, for instance.

                    1 Reply Last reply
                    0
                    • lina@vt.socialL lina@vt.social

                      This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

                      I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

                      LLMs broke CTFs.

                      alazar@infosec.exchangeA This user is from outside of this forum
                      alazar@infosec.exchangeA This user is from outside of this forum
                      alazar@infosec.exchange
                      wrote last edited by
                      #17

                      @lina I mean, yes, but I don't know if complete pessimism is warranted. It's definitely broken a lot of public CTFs but I think society will find a way, and maybe it's not even the worst thing.

                      Forever ago, when I was in uni, a few colleagues and I would do this thing every semester where we'd do one of the nominally individual projects together, ahead of time. Technically, it was cheating, but we did it specifically so we could go "off the rails" and try things that were not in the guide that our TAs handed out to us.

                      For instance, the "rite of passage" for every 3rd year student in my generation was a transformer design. It was something you'd work on, on and off, for the whole semester (we're talking big three-phase transformers for power distribution here so there was definitely *a lot* to work on).

                      They'd give us a sort of step-by-step guide to walk us through the whole process (start here, compute this quantity, check it against this standard table etc.) and you'd consult with the TAs along the semester. It was definitely interesting, if tedious at times, but tediousness was the lesser problem.

                      The bigger deal was that these guides weren't updated very often -- because the associated industrial standards don't get updated that often.

                      So what we did was that those of us who actually wanted to be there in the first place got together and we tried to experiment with various things not in the guide. Different isolation materials that we'd just read about, different cooling methods and so on. Not that we could show those to the TAs (can't blame them but most of them weren't very interested), and we didn't always have a lot of time or access to all the data we needed (we were students and had student budgets to contend with -- we couldn't buy standards, for example, and this was before libgen).

                      The cool thing about it was that it removed any kind of metrics pressure from this process. We weren't going to be ranked by anyone, there were no arsehole TAs to cater towards and no obtuse professors whose personal preferences in the formatting of our reports who had to be placated.

                      We also didn't have to show our results to anyone who wasn't primarily interested in mentoring us. We worked *really* quickly because we had graded assignments to finish first and clung to whatever had remained of our social lives by the third year of an engineering degree, so "deadlines" were super tight.

                      That quickly removed any incentive to cheat. When there was no way around it (tl;dr outdated guides sometimes didn't work in the context we used them, I have some fun stories about that) we totally cheated on the "real" assignments -- but never on these ones. This was technically cheating, too -- in the process of working out these differences we'd obviously discuss how we'd gone through the "real" assignments, share results and so on -- but since we all had different design targets (tl;dr same transformer designs but with different target parameters, so you couldn't just copy your colleague's work) it wasn't really a big deal.

                      With no incentive to cheat and nothing to get ahead of other than the limits of our own knowledge and engineering abilities, we often found ourselves doing things we normally wouldn't do for our regular assignments. We couldn't try things out in a lab, so if we doubted our analytical results for some particular configuration, we'd compare it against general EM field numerical simulations. If we didn't have a good simulation package for what we were after, we'd try to work out different analytical solutions for related quantities and see if we got similar results were similar.

                      We ended up learning a lot more than we did from the "real" assignments, mostly because our priorities were different. With real assignment, your main objective was inevitably to get a high grade, and keeping the TAs and the prof happy were as critical as tracking the decimal point.

                      Whereas with our "social" assignments, our main objectives were 1. to learn new things and 2. to get something that looked like a workable design that was an improvement over the "real" one in some aspect of our choice (better efficiency, reduced size, less coolant, whatever). If you "cheated" your way through it, #1 was obviously not happening and you were never really sure of #2, so no one was motivated to do it.

                      I think this is what we're eventually going to converge towards in other spaces, too: CTFs organised in smaller circles, with fewer external metrics and motivators, and an emphasis on cooperation, shifting the "competition" towards external factors than competition among teams/team members.

                      When CTF scores matter because they could potentially get you ahead in the race for an intership, every twenty year-old will eventually give in to cheating -- if only because it's the only way to stay in the race with people who do it because it's the only way they *can* do it. But if you take out the cheese, it's not much of a rat race anymore.

                      I'm old enough to have seen this happen to hackathons to some degree. At first, after hackathons had grown into their "competitive" form from their "let's hack shit together" roots, everyone was super enthusiastic and people every age jumped in. After a while, when prep became intensive enough that the only way to a prize was to implement 90% of what you meant to do beforehand (e.g. in a library) and then show up on the day of the hackathon and just piece the frontend together, everyone who was in it primarily for the thrill of focused building noped out.

                      Did that stop hackathons? Not at all, it just "split" things into:

                      - Corporate-funded hackathons which almost no one attends after they finish school -- where people rarely produce anything of value, and it's fine, because everyone understands that's not what they're there for. The "cheese" wasn't explicitly removed here, it's just at some point almost everyone recognised it's unattainable and the amount of hoops you have to jump through in order to attain it just isn't worth it when you're programming professionally
                      - "Real" hackathons, where people get together to work on a real project together, and the only competition is maybe the how-much-wasabi-you-can-eat-without-crying competition when everyone goes out for sushi the next day.

                      1 Reply Last reply
                      0
                      • lina@vt.socialL lina@vt.social

                        There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.

                        We're screwed.

                        At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)

                        But that doesn't matter, because it found it.

                        The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.

                        (Continued)

                        echedellelr@soc.masfloss.netE This user is from outside of this forum
                        echedellelr@soc.masfloss.netE This user is from outside of this forum
                        echedellelr@soc.masfloss.net
                        wrote last edited by
                        #18

                        @lina most of the CTF include 6-7 challenges to be solved in 4 hours.

                        Those CTFs expect you to know a typical set of forensync tools managed by an external guy/gal/entity which is somewhat known to be able to do it in time.

                        It stops being funny when you stop learning by doing and starts being a "kill'em all" competition.

                        lina@vt.socialL 1 Reply Last reply
                        0
                        • lina@vt.socialL lina@vt.social

                          This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

                          I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

                          LLMs broke CTFs.

                          abacabadabacaba@infosec.exchangeA This user is from outside of this forum
                          abacabadabacaba@infosec.exchangeA This user is from outside of this forum
                          abacabadabacaba@infosec.exchange
                          wrote last edited by
                          #19

                          @lina Programming competitions are banning LLMs, see e.g. https://info.atcoder.jp/entry/llm-rules-en. How are CTFs any different?

                          lina@vt.socialL neatchee@urusai.socialN 2 Replies Last reply
                          0
                          • echedellelr@soc.masfloss.netE echedellelr@soc.masfloss.net

                            @lina most of the CTF include 6-7 challenges to be solved in 4 hours.

                            Those CTFs expect you to know a typical set of forensync tools managed by an external guy/gal/entity which is somewhat known to be able to do it in time.

                            It stops being funny when you stop learning by doing and starts being a "kill'em all" competition.

                            lina@vt.socialL This user is from outside of this forum
                            lina@vt.socialL This user is from outside of this forum
                            lina@vt.social
                            wrote last edited by
                            #20

                            @echedellelr The ones I've worked on are less about "forensic tooling" and more about diverse (reverse)engineering challenges. They also usually run for a couple days and ~16 chals.

                            It evens out the playing field because pre-prepared tooling doesn't help you as much, since the challenges tend to be quite novel. I much prefer those to "write a ROP chain and exploit this service" or "crack this password" (not requiring an inordinate amount of compute, no more than 1hr of CPU time on a contemporary PC, is also a hard level design rule). There's usually one or two more typical infosec ones but they aren't the majority.

                            One example is a CrackMe challenge that was written in Verilog (implementing a custom CPU to run the actual crackme binary).

                            echedellelr@soc.masfloss.netE 1 Reply Last reply
                            0
                            • lina@vt.socialL lina@vt.social

                              I might still do a monthly challenge or something in the future so people who want to have fun and learn can have fun and learn. That's still okay.

                              But CTFs as discrete competitions with winners are dead.

                              A CTF competition is basically gameified homework.

                              LLMs broke the game. Now all that's left is self study.

                              coldclimate@hachyderm.ioC This user is from outside of this forum
                              coldclimate@hachyderm.ioC This user is from outside of this forum
                              coldclimate@hachyderm.io
                              wrote last edited by
                              #21

                              @lina thank you for this excellent thread

                              1 Reply Last reply
                              0
                              • lina@vt.socialL lina@vt.social

                                @echedellelr The ones I've worked on are less about "forensic tooling" and more about diverse (reverse)engineering challenges. They also usually run for a couple days and ~16 chals.

                                It evens out the playing field because pre-prepared tooling doesn't help you as much, since the challenges tend to be quite novel. I much prefer those to "write a ROP chain and exploit this service" or "crack this password" (not requiring an inordinate amount of compute, no more than 1hr of CPU time on a contemporary PC, is also a hard level design rule). There's usually one or two more typical infosec ones but they aren't the majority.

                                One example is a CrackMe challenge that was written in Verilog (implementing a custom CPU to run the actual crackme binary).

                                echedellelr@soc.masfloss.netE This user is from outside of this forum
                                echedellelr@soc.masfloss.netE This user is from outside of this forum
                                echedellelr@soc.masfloss.net
                                wrote last edited by
                                #22

                                @lina at least the ones performed by the National Police or any other national agency here is like that, and are the typical ones I see.

                                Prolly is a cultural thing by country

                                lina@vt.socialL 1 Reply Last reply
                                0
                                • echedellelr@soc.masfloss.netE echedellelr@soc.masfloss.net

                                  @lina at least the ones performed by the National Police or any other national agency here is like that, and are the typical ones I see.

                                  Prolly is a cultural thing by country

                                  lina@vt.socialL This user is from outside of this forum
                                  lina@vt.socialL This user is from outside of this forum
                                  lina@vt.social
                                  wrote last edited by
                                  #23

                                  @echedellelr CTFs run by organizations focusing on infosec and offensive capability would necessarily lean that way. That's not the world I'm interested in. There are many CTFs not associated with such organizations with different themes.

                                  echedellelr@soc.masfloss.netE 1 Reply Last reply
                                  0
                                  • lina@vt.socialL lina@vt.social

                                    There's a lot of discourse on Twitter about people using LLMs to solve CTF challenges. I used to write CTF challenges in a past life, so I threw a couple of my hardest ones at it.

                                    We're screwed.

                                    At least with text-file style challenges ("source code provided" etc), Claude Opus solves them quickly. For the "simpler" of the two, it just very quickly ran through the steps to solve it. For the more "ridiculous" challenge, it took a long while, and in fact as I type this it's still burning tokens "verifying" the flag even though it very obviously found the flag and it knows it (it's leetspeak and it identified that and that it's plausible). LLMs are, indeed, still completely unintelligent, because no human would waste time verifying a flag and second-guessing itself when it very obviously is correct. (Also you could just run it...)

                                    But that doesn't matter, because it found it.

                                    The thing is, CTF challenges aren't about inventing the next great invention or having a rare spark of genius. CTF challenges are about learning things by doing. You're supposed to enjoy the process. The whole point of a well-designed CTF challenge is that anyone, given enough time and effort and self-improvement and learning, can solve it. The goal isn't actually to get the flag, otherwise you'd just ask another team for the flag (which is against the rules of course). The goal is to get the flag by yourself. If you ask an LLM to get the flag for you, you aren't doing that.

                                    (Continued)

                                    nightwolf@defcon.socialN This user is from outside of this forum
                                    nightwolf@defcon.socialN This user is from outside of this forum
                                    nightwolf@defcon.social
                                    wrote last edited by
                                    #24

                                    @lina I view CTFs mostly as a way to learn and think that still exists. If you LLM the whole thing, you just hamper your own ability to learn. Competition wise for jeopardy it's got some challenges. I think it may be interesting to see if there is a shift to more Attack and Defense, King of the Hill or different structures where LLMs would still help but one shot single solutions aren't necessarily the best possible approach.

                                    lina@vt.socialL 1 Reply Last reply
                                    0
                                    • abacabadabacaba@infosec.exchangeA abacabadabacaba@infosec.exchange

                                      @lina Programming competitions are banning LLMs, see e.g. https://info.atcoder.jp/entry/llm-rules-en. How are CTFs any different?

                                      lina@vt.socialL This user is from outside of this forum
                                      lina@vt.socialL This user is from outside of this forum
                                      lina@vt.social
                                      wrote last edited by
                                      #25

                                      @abacabadabacaba It's much easier to parallel construct a CTF solution than a programming challenge. CTF challenges are all about having a series of realizations that lead to the answer.

                                      If you ban LLMs in a programming challenge, you could conceivably detect signs of LLM usage in the program in various ways (not perfectly, but you could try). A CTF challenge just has one output, the flag. Everyone finds the same flag. There is no way to tell how you did it. You'd have to introduce invasive monitoring like online tests, and even if you record people's screens, they could easily be running an LLM on another machine to have it come up with the "key points" to the solution which you just implement. You can't prove that someone didn't have some ideas on their own.

                                      abacabadabacaba@infosec.exchangeA 1 Reply Last reply
                                      0
                                      • lina@vt.socialL lina@vt.social

                                        @echedellelr CTFs run by organizations focusing on infosec and offensive capability would necessarily lean that way. That's not the world I'm interested in. There are many CTFs not associated with such organizations with different themes.

                                        echedellelr@soc.masfloss.netE This user is from outside of this forum
                                        echedellelr@soc.masfloss.netE This user is from outside of this forum
                                        echedellelr@soc.masfloss.net
                                        wrote last edited by
                                        #26

                                        @lina sorry, I replied because you were generalising and was not my experience here.

                                        1 Reply Last reply
                                        0
                                        • lina@vt.socialL lina@vt.social

                                          This is, quite frankly, the same problem LLM agents are causing in software engineering and such, just way worse. Because with CTFs, there is no "quality metric". Once you get the flag you get the flag. It doesn't matter if your approach was ridiculous or you completely misunderstood the problem or "winged it" in the worst way possible or the solver is a spaghetti ball of technical debt. It doesn't matter if Claude made a dozen reasoning errors in its chain that no human would (which it did). Every time it gets it wrong it just tries again, and it can try again orders of magnitude faster than a human, so it doesn't matter.

                                          I don't have a solution for this. You can't ban LLMs, people will use them regardless. You could try interviewing teams one on one after the challenge to see if they actually have a coherent story and clearly did the work, but even then you could conceivably cheat using an LLM and then wait it out a bit to make the time spent plausible, study the reasoning chain, and convince someone that you did the work. It's like LLMs in academics, but much worse due to the time constraints and explicitly competitive nature of CTFs.

                                          LLMs broke CTFs.

                                          yalter@mastodon.onlineY This user is from outside of this forum
                                          yalter@mastodon.onlineY This user is from outside of this forum
                                          yalter@mastodon.online
                                          wrote last edited by
                                          #27

                                          @lina perhaps having separate categories for LLMs allowed vs. banned would help with 90% of this problem? So ppl who want to use LLM can do so at their pleasure, and only ppl who actively want to cheat (hopefully very few) will try to use LLM in the banned category.

                                          lina@vt.socialL 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups