Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.

People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.

Scheduled Pinned Locked Moved Uncategorized
69 Posts 39 Posters 63 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • snoopj@hachyderm.ioS snoopj@hachyderm.io

    @futurebird even this is perhaps being too generous to the models, since in the process of generating a "response" they are playing the same game with themselves:

    add a token, then that's the new input, figure out the next token to continue *that* sequence, and so on until it's time to stop (which is just a special sort of token)

    futurebird@sauropods.winF This user is from outside of this forum
    futurebird@sauropods.winF This user is from outside of this forum
    futurebird@sauropods.win
    wrote last edited by
    #60

    @SnoopJ

    what kind of token do you men here?

    swelljoe@mas.toS snoopj@hachyderm.ioS 2 Replies Last reply
    0
    • futurebird@sauropods.winF futurebird@sauropods.win

      @SnoopJ

      what kind of token do you men here?

      swelljoe@mas.toS This user is from outside of this forum
      swelljoe@mas.toS This user is from outside of this forum
      swelljoe@mas.to
      wrote last edited by
      #61

      @futurebird @SnoopJ a token is, basically, a word (not literally, input gets tokenized by a tokenizer). I seem to recall you have some programming experience...it's like a tokenizer in a programming language, where in BASIC the PRINT command might turn into a number that is a compact representation of the function call for PRINT.

      The tokenizer splits and compacts sentences for ingestion into the model. And on output, it estimates the most likely next token based on all prior tokens.

      swelljoe@mas.toS 1 Reply Last reply
      0
      • swelljoe@mas.toS swelljoe@mas.to

        @futurebird @SnoopJ a token is, basically, a word (not literally, input gets tokenized by a tokenizer). I seem to recall you have some programming experience...it's like a tokenizer in a programming language, where in BASIC the PRINT command might turn into a number that is a compact representation of the function call for PRINT.

        The tokenizer splits and compacts sentences for ingestion into the model. And on output, it estimates the most likely next token based on all prior tokens.

        swelljoe@mas.toS This user is from outside of this forum
        swelljoe@mas.toS This user is from outside of this forum
        swelljoe@mas.to
        wrote last edited by
        #62

        @futurebird @SnoopJ but, reading the rest of the thread, it sounds like you already know that!

        1 Reply Last reply
        0
        • futurebird@sauropods.winF futurebird@sauropods.win

          @SnoopJ

          what kind of token do you men here?

          snoopj@hachyderm.ioS This user is from outside of this forum
          snoopj@hachyderm.ioS This user is from outside of this forum
          snoopj@hachyderm.io
          wrote last edited by
          #63

          @futurebird the "bit of arbitrary-length text that turns into an integer ID" sort that allow the model's guts to work with convenient mathematical primitives, but still get back to text in the end. Any body of text is a sequence of some number of tokens.

          In particular, tokens are *not* typically along word boundaries. 3blue1brown in his series uses this for convenience's sake, but points out that it's a lie with a handy little visual demonstration (attached).

          OpenAI provide a little UI for exploring tokenization, although it's changed bit over the years:

          https://platform.openai.com/tokenizer

          Aa bit of a bias towards token boundaries being on English word boundaries in the newest generation, it seems, but it's all the same idea, and I'm sure it works differently in other languages because of the usual anglophone bias.

          Link Preview Image
          snoopj@hachyderm.ioS 1 Reply Last reply
          0
          • snoopj@hachyderm.ioS snoopj@hachyderm.io

            @futurebird the "bit of arbitrary-length text that turns into an integer ID" sort that allow the model's guts to work with convenient mathematical primitives, but still get back to text in the end. Any body of text is a sequence of some number of tokens.

            In particular, tokens are *not* typically along word boundaries. 3blue1brown in his series uses this for convenience's sake, but points out that it's a lie with a handy little visual demonstration (attached).

            OpenAI provide a little UI for exploring tokenization, although it's changed bit over the years:

            https://platform.openai.com/tokenizer

            Aa bit of a bias towards token boundaries being on English word boundaries in the newest generation, it seems, but it's all the same idea, and I'm sure it works differently in other languages because of the usual anglophone bias.

            Link Preview Image
            snoopj@hachyderm.ioS This user is from outside of this forum
            snoopj@hachyderm.ioS This user is from outside of this forum
            snoopj@hachyderm.io
            wrote last edited by
            #64

            @futurebird that is: a large language model is (usually):

            1) A pile of known tokens
            2) Learned relationships between tokens, in-context (i.e. accounting for long-distance correlations (like "the last paragraph") and many of them)
            3) The machinery to emit selections from (1) in a way that tries to obey the relationships learned from (2)

            1 Reply Last reply
            0
            • snoopj@hachyderm.ioS snoopj@hachyderm.io

              @futurebird apologies for pedantic-quibbling on your thread twice, but…

              LLMs in their platonic form are not weighted in this manner, they are exactly as you have imagined here: they reproduce the statistical distribution of tokens in their training corpus.

              If you've never played around with GPT-2 or GPT-3 (from the era before we had GPT-3.5 and from there "ChatGPT"), they often would do *precisely* this sort of direct, non-conversational continuation. You could feed in a sentence or two and get "autocomplete", or you could feed in `<html><body><span>Lorem ipsum` and get a plasuible-looking continuation of an HTML document (or whatever)

              Once "Chat" models (and the paradigm shift to RLHF to "fine-tune" model performance) showed up, we started seeing the conversational pattern. I don't know the details there, but there is definitely a distinct line between when we first started seeing "LLMs" and when we started seeing models arranged explicitly around a conversational format.

              dpiponi@mathstodon.xyzD This user is from outside of this forum
              dpiponi@mathstodon.xyzD This user is from outside of this forum
              dpiponi@mathstodon.xyz
              wrote last edited by
              #65

              @SnoopJ @futurebird It's pretty straightforward to play with "raw" LLMs, eg. with ollama or llama.cpp.

              BTW If we're being pedantic "they reproduce the statistical distribution of tokens in their training corpus" isn't quite right. Inductive bias is crucial otherwise the model grinds to a halt on novel inputs. (And I'd really like to know what it looks like when you do this but I don't have the resources to find out.)

              snoopj@hachyderm.ioS 1 Reply Last reply
              0
              • dpiponi@mathstodon.xyzD dpiponi@mathstodon.xyz

                @SnoopJ @futurebird It's pretty straightforward to play with "raw" LLMs, eg. with ollama or llama.cpp.

                BTW If we're being pedantic "they reproduce the statistical distribution of tokens in their training corpus" isn't quite right. Inductive bias is crucial otherwise the model grinds to a halt on novel inputs. (And I'd really like to know what it looks like when you do this but I don't have the resources to find out.)

                snoopj@hachyderm.ioS This user is from outside of this forum
                snoopj@hachyderm.ioS This user is from outside of this forum
                snoopj@hachyderm.io
                wrote last edited by
                #66

                @dpiponi @futurebird I should probably have said *attempt* to reproduce 🙂

                But as you say, novel inputs can be quite tricky, as in the case of the "glitch tokens" of GPTs gone by: https://www.vice.com/en/article/ai-chatgpt-tokens-words-break-reddit/

                At the time, they slapped a band-aid on and just fell back onto a generic "an error has occurred" response and no generation if one of those tokens was input. I don't know what the purported solution is to the same problem today, aside from "whatever it is, it's probably rubbish and involves a lot of lying"

                snoopj@hachyderm.ioS 1 Reply Last reply
                0
                • snoopj@hachyderm.ioS snoopj@hachyderm.io

                  @dpiponi @futurebird I should probably have said *attempt* to reproduce 🙂

                  But as you say, novel inputs can be quite tricky, as in the case of the "glitch tokens" of GPTs gone by: https://www.vice.com/en/article/ai-chatgpt-tokens-words-break-reddit/

                  At the time, they slapped a band-aid on and just fell back onto a generic "an error has occurred" response and no generation if one of those tokens was input. I don't know what the purported solution is to the same problem today, aside from "whatever it is, it's probably rubbish and involves a lot of lying"

                  snoopj@hachyderm.ioS This user is from outside of this forum
                  snoopj@hachyderm.ioS This user is from outside of this forum
                  snoopj@hachyderm.io
                  wrote last edited by
                  #67

                  @dpiponi @futurebird annoyingly, the LessWrong write-up linked to that 'SolidGoldMagikarp' work is actually quite good, but in the time since that research was published there has been similar research published in more uhhh reputable places, e.g. https://dl.acm.org/doi/full/10.1145/3660799

                  1 Reply Last reply
                  0
                  • u0421793@toot.pikopublish.ingU u0421793@toot.pikopublish.ing

                    @futurebird@sauropods.win I honestly think (unpopular opinion here) that most of the cost of LLM-based AI thus far is in ‘training’. Not training as in running the phenomenal amount of harvested stolen text and image input through tokenisation processes and reward giving through weight assignment and vector assessment, using more GPUs than exist on Earth, but rather, lots and lots and lots of money paying humans to fake it all and build in patches – patch after patch on top of patch of corrective behaviour, encoded themselves as vector weights. The training had nothing much to do with running it all through GPUs, I believe that probably took an embarassing but totally affordable amount of time and energy. I believe (with no visible means of factual reference to cite) that most of the expenditure of these capital-burning companies was ‘training’ by paying humans and then encoding their resulting guidance. Paying workers.

                    lauerhahn@sfba.socialL This user is from outside of this forum
                    lauerhahn@sfba.socialL This user is from outside of this forum
                    lauerhahn@sfba.social
                    wrote last edited by
                    #68

                    @u0421793 @futurebird Yes. Or getting humans to do that labor for no pay.

                    1 Reply Last reply
                    0
                    • futurebird@sauropods.winF futurebird@sauropods.win

                      Thus the training data didn't just contain text, but rather text where each passage is tagged and attributed to a particular user.

                      This aspect of the training data was critical in creating the illusion of talking to another person.

                      An LLM doesn't just predict the next text. It predicts the next text that might come from another user. You need to hard code this in to make it work well.

                      Leave it out and there is no conversation.

                      poleguy@mastodon.socialP This user is from outside of this forum
                      poleguy@mastodon.socialP This user is from outside of this forum
                      poleguy@mastodon.social
                      wrote last edited by
                      #69

                      @futurebird Do you have references explaining how the recent LLMs have been trained to do this?

                      I'd be interested in understanding what this training methodology looks like in detail.

                      I didn't actually think this "conversational training" was necessary, as I thought the chat-bots were just told 'you are a chat bot, pretend to have a conversation, put your output directly in the next line. Here is the user's first inputs: "It's a lovely day"' Or something like that.

                      1 Reply Last reply
                      0
                      • R relay@relay.mycrowd.ca shared this topic
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups