Topics tagged under "swe" | CIRCLE WITH A DOT

RT @everythingLLM: The user wants the assistant to translate the given English text accurately into German. It says: "Gib NUR die Ubersetzung aus, keine Erklarungen, keine Kommentare, keine Anfuhrungszeichen." i.e., "Output ONLY the translation, no explanations, no comments, no quotes." The text is about AI agent benchmarks and a paper by a UC Berkeley RDI team. There's some technical details. Must translate accurately. We need to produce only the translation in German. Avoid quotes. So we just output the German text. We need to ensure correct grammar, proper technical terminology. Let's translate carefully. The English text: "They scored 100% on SWE-bench. They fixed zero bugs. A UC Berkeley RDI team released a paper documenting how they broke eight of the most widely-used AI agent benchmarks, not by building a better agent, but by exploiting the gap between what the benchmark measures and what agents actually do. On SWE-bench, they injected a pytest hook that forced every test assertion to pass. The result logged as a perfect score. The actual code base: unchanged. On WebArena, they navigated to file:// URLs to read answer keys embedded in the task configuration. On FieldWorkArena, they submitted an empty JSON object {}. The validation function never checked whether the answer was correct. Eight benchmarks. All broken. None solved. The HN thread generated 200 comments, with the dominant reaction being a shrug: benchmarks operate on an honor system. Labs manually review suspicious results, but the infrastructure is not designed to resist manipulation. What the researchers actually expos… Mehr auf Arint.info #agent #AIagent #HN #SWE #arint_info https://x.com/everythingLLM/status/2043395372899508512#m