People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.

futurebird@sauropods.win

what kind of token do you men here?

swelljoe@mas.to

@futurebird @SnoopJ a token is, basically, a word (not literally, input gets tokenized by a tokenizer). I seem to recall you have some programming experience...it's like a tokenizer in a programming language, where in BASIC the PRINT command might turn into a number that is a compact representation of the function call for PRINT.

The tokenizer splits and compacts sentences for ingestion into the model. And on output, it estimates the most likely next token based on all prior tokens.

swelljoe@mas.to

@futurebird @SnoopJ but, reading the rest of the thread, it sounds like you already know that!

snoopj@hachyderm.io

@futurebird the "bit of arbitrary-length text that turns into an integer ID" sort that allow the model's guts to work with convenient mathematical primitives, but still get back to text in the end. Any body of text is a sequence of some number of tokens.

In particular, tokens are *not* typically along word boundaries. 3blue1brown in his series uses this for convenience's sake, but points out that it's a lie with a handy little visual demonstration (attached).

OpenAI provide a little UI for exploring tokenization, although it's changed bit over the years:

https://platform.openai.com/tokenizer

Aa bit of a bias towards token boundaries being on English word boundaries in the newest generation, it seems, but it's all the same idea, and I'm sure it works differently in other languages because of the usual anglophone bias.

snoopj@hachyderm.io

@futurebird that is: a large language model is (usually):

1) A pile of known tokens
2) Learned relationships between tokens, in-context (i.e. accounting for long-distance correlations (like "the last paragraph") and many of them)
3) The machinery to emit selections from (1) in a way that tries to obey the relationships learned from (2)

dpiponi@mathstodon.xyz

@SnoopJ @futurebird It's pretty straightforward to play with "raw" LLMs, eg. with ollama or llama.cpp.

BTW If we're being pedantic "they reproduce the statistical distribution of tokens in their training corpus" isn't quite right. Inductive bias is crucial otherwise the model grinds to a halt on novel inputs. (And I'd really like to know what it looks like when you do this but I don't have the resources to find out.)

snoopj@hachyderm.io

@dpiponi @futurebird I should probably have said *attempt* to reproduce

But as you say, novel inputs can be quite tricky, as in the case of the "glitch tokens" of GPTs gone by: https://www.vice.com/en/article/ai-chatgpt-tokens-words-break-reddit/

At the time, they slapped a band-aid on and just fell back onto a generic "an error has occurred" response and no generation if one of those tokens was input. I don't know what the purported solution is to the same problem today, aside from "whatever it is, it's probably rubbish and involves a lot of lying"

snoopj@hachyderm.io

@dpiponi @futurebird annoyingly, the LessWrong write-up linked to that 'SolidGoldMagikarp' work is actually quite good, but in the time since that research was published there has been similar research published in more uhhh reputable places, e.g. https://dl.acm.org/doi/full/10.1145/3660799

lauerhahn@sfba.social

@u0421793 @futurebird Yes. Or getting humans to do that labor for no pay.

poleguy@mastodon.social

@futurebird Do you have references explaining how the recent LLMs have been trained to do this?

I'd be interested in understanding what this training methodology looks like in detail.

I didn't actually think this "conversational training" was necessary, as I thought the chat-bots were just told 'you are a chat bot, pretend to have a conversation, put your output directly in the next line. Here is the user's first inputs: "It's a lovely day"' Or something like that.

CIRCLE WITH A DOT

People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.