We knew, but the proof is nice.

davidaugust@mastodon.online

@drifthood yes, there does seem to be a threshold over which in some respects only humans cross over to one side.

I see that sort of begging in a dog. He wants the treat, so instead of just doing the desired behavior the human command is asking for, he tries every response that has ever gotten him a treat until he “unlocks” the treat. Humans can and do do this too from time to time, but humans _also_ actually communicate and understand from time to time as well.

davidaugust@mastodon.online

@joriki it’s from August.

audioflyer79@mstdn.social

@davidaugust Ecosia AI gets it right. It looks like the paper referenced was published in 2025, so the research conducted prior. The models are all much better now. I’m no AI apologist, but I think any argument of “AI sucks because it’s not good at _____” is on tenuous ground and will be proven wrong as the models continue to improve. @Ecosia

alisynthesis@io.waxandleather.com

@audioflyer79 @davidaugust I mean, it's worth noting that the LLMs have ingested that paper by now. : /

audioflyer79@mstdn.social

@alisynthesis @davidaugust fair enough. I changed up the problem completely and added some reasoning and it did pretty well. It appears to be generating code to solve the math. The only thing it missed is that very unripe bananas are green, not yellow.

James picks 40 apples on Monday. Then he picks 35 lemons on Tuesday. On Wednesday, he picks half as many bananas as he did apples, but five of them were very unripe. How many yellow fruits does James have?

pikesley@mastodon.me.uk

@davidaugust

Amo Bishop Rodent (@pikesley@mastodon.me.uk)

"We made the computers, the notoriously accurate calculating machines, worse at arithmetic. This is surely progress along the path to creating Computer God"

mastodon.me.uk (mastodon.me.uk)

ozzelot@mstdn.social

@lemgandi
The wetness of water has been hotly debated, as to some wet means "covered with or soaked in water", and it's questioned whether water is covered with itself.
@davidaugust

bouriquet@mastodon.social

@Karen5Lund Maybe because people stopped writing efficient code about 20 years ago?

pascal_le_merrer@mastodon.social

@davidaugust AGI is coming son 🤭

flq@freiburg.social

@davidaugust interesting. Had to ask. Already fixed?

glitzersachen@hachyderm.io

@davidaugust @scottjenson @xdydx

True. See @xdydx 's reply.

elithebearded@fed.qaz.red

@davidaugust

Shortcut to paper: https://arxiv.org/pdf/2410.05229

morten_skaaning@mastodon.gamedev.place

@audioflyer79 @alisynthesis @davidaugust how does it do if you swap the colors of the fruit?

davidaugust@mastodon.online

@pascal_le_merrer any day now. I hear potus say in two weeks.

davidaugust@mastodon.online

@flq yes, many systems have tools and/or abilities built in to take over basic math operations that simpler LLMs failed at.

The salient and enduring issue, I think, is that the spin and marketing of LLMs as "understanding," "thinking" or "intelligent" (as those words typical meanings suggest) remains largely fictional.

joriki@infosec.exchange

@davidaugust

October 2024

Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

The new frontier in large language models is the ability to “reason” their way through problems. New research from Apple says it's not quite what it's cracked up to be.

WIRED (www.wired.com)

bladecoder@androiddev.social

@drifthood @davidaugust This makes me think of "Clever Hans", the horse that appeared to do arithmetics but actually just responded to involuntary human cues:
https://en.wikipedia.org/wiki/Clever_Hans

erwinrossen@mas.to

@davidaugust Of course an LLM cannot do math, but to be honest, that is also not what they're designed for. An LLM these days like Claude knows that it should take a calculator and type the equation in there, instead of hallucinating an answer. Complaining that an LLM can't do math is like complaining a screwdriver can't drill a hole.

You can counter that there are plenty of people who are using the screwdriver to drill the hole, but that is not on the tool, that is on the user.

erwinrossen@mas.to

@davidaugust When did they do this test? I tried it with the following LLMs: Sonnet 4.6, Codex 5.3, GPT-5.4, GPT-5-Mini and Kimi-K2.5. They all answer the kiwi question correctly.

CIRCLE WITH A DOT

We knew, but the proof is nice.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Verifying your browser | Nitter

Verifying your browser | Nitter

Amo Bishop Rodent (@pikesley@mastodon.me.uk)

Verifying your browser | Nitter

Verifying your browser | Nitter

Verifying your browser | Nitter

Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

Verifying your browser | Nitter