yodon 4 days ago

This looks super valuable!

That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.

Hopefully OpenAI isn't that biased at generating die rolls, so is that number actually giving us information about the accuracy of the probability assessments?

  • teej 4 days ago

    Fair dice rolls is not an objective that cloud LLMs are optimized for. You should assume that LLMs cannot perform this task.

    This is a problem when people naively use "give an answer on a scale of 1-10" in their prompts. LLMs are biased towards particular numbers (like humans!) and cannot linearly map an answer to a scale.

    It's extremely concerning when teams do this in a context like medicine. Asking an LLM "how severe is this condition" on a numeric scale is fraudulent and dangerous.

    • low_tech_love 4 days ago

      This week I was on a meeting for a rather important scientific project at the university, and I asked the other participants “can we somehow reliably cluster this data to try to detect groups of similar outcomes?” to which a colleague promptly responded “oh yeah, chatGPT can do that easily”.

      • stanislavb 4 days ago

        I guess, he's right - it will be easy and relatively accurate. Relatively/seemingly.

        • low_tech_love 4 days ago

          So that’s it then? We replace every well-understood, objective algorithm with well-hidden, fake, superficial surrogate answers from an AI?

          • yorwba 4 days ago

            "cluster this data to try to detect groups of similar outcomes" is typically a fairly subjective task. If the objective algorithm optimizes for an objective criterion that doesn't match the subjective criteria that will be used to evaluate it, that objectivity is just as superficial.

            • low_tech_love 3 days ago

              I’m not sure I follow. Every clustering algorithm that’s not an LLM prompt has a well-known, specified mathematical/computational functioning; no matter how complex, there's a perfectly concrete structure behind it, and whether you agree or not with its results doesn’t change anything about them.

              The results of an LLM are an arbitrary approximation of what a human would expect to see as the results of a query. In other words, it correlates very well with human expectations and is very good at fooling you into believing it. But can it provide you with results that you disagree with?

              And more importantly, can you trust these results scientifically?

              • yorwba 3 days ago

                If you use k-means to cluster your data into 100 clusters, it will do so, irrespective of whether it is meaningful to do so. Perfectly objective, but what does that objectivity buy you? If your pet theory is that there are 100 groups, you'll be actually less likely to get results that disagree with that than if you ask an LLM how many groups there are.

                But the real question is not whether you agree with the results, but whether they're useful. If you apply an objective method to data it is unsuitable for, it's garbage in, objective garbage out. Whether the method is suitable or not is not always something you can decide a priori, then you need to check.

                And if trying it out shows that LLM-provided clusters are more useful than other methods, you should swallow your pride and accept that, even if you disagree on philosophical grounds. (Or it might show that the LLM has no idea what it's doing! Then you can feel good about yourself.)

                • low_tech_love 6 hours ago

                  This is a very interesting conversation. Correlates well with the responses I got from the colleague during the meeting. Would you ask ChatGPT to do a t-test for you and blindly accept its results as well, regardless of whether the math behind it was sound or not? The reason why we use math and statistics in experimental research is because we want objective results, not simply results that correlate with our expectations (that we can get from watching YouTube or reading blogs). The objectivity of K-Means buys me the trust that whatever clusters I get have been obtained with a well-know and understood method, in which my expectations have absolute no influence. Also, I know that the next person will get similar results, which also gives me trust in their results. So we can all have a shared, independent, objective understanding of a piece of data.

                  I wonder, if well-educated and technically-literate people like him and you are willing to accept arbitrary results from a language model as a replacement for objective math, then what should we expect from the general public?

    • Terr_ 4 days ago

      It'll also give you different results based on logically-irrelevant numbers that might appear elsewhere in the collaborative fiction document.

  • dragonwriter 4 days ago

    > That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.

    Finding that an LLM is biased toward inventing die rolls that are the median result rounded to an available result by the most common rounding method is...not particularly surprising. If you want a fair RNG, use an RNG deigned to be fair, not an LLM where that would be, at best, an emergent accidental property.

  • ngrislain 4 days ago

    Thank you! The number is the the sum of the logprobs from the token constituting the individual values. So it does represent the likelihood of seeing this value. So yes OpenAI is super-biased as a random number generator. We sampled other values from OpenAI and got other die roll values, but with much lower probs (5 has 8% chances ).

    • ngrislain 4 days ago

      More precisely it represents the likelihood of seeing this value conditional on the tokens before it.

      • elcritch 4 days ago

        Even without other tokens before it the LLM is probably showing the probability of dice rolls based on its training data. I’d guess humans tend to prefer “3” or “4” as it’s nearer the avg/median and feels fairer.

        AFAICT, the LLMs aren’t creating new mental mappings of “dice are a symmetric and should give equal probability to land on any side followed by using that info to infer they should use a RNG.”

      • radarsat1 4 days ago

        and i guess includes other possibilities than numbers, like 'f' which could lead to four or five. There's probably a separate probability for 'fi' and 'fo' too.

  • mmcwilliams 4 days ago

    What about the models they offer would make you think that it wouldn't be biased at generating random die rolls?

    • low_tech_love 4 days ago

      I think the problem is that for every person who actually understands that ChatGPT should not be used for objective things like a die roll, there are 10 or 20 who would say “well, it looks ok, and it’s fast, convenient, and it passes nicely for an answer”. People are pushing the boundaries and waiting for the backlash, but the backlash never actually comes… so they keep pushing.

      Think about this: suppose you’re reading a scientific paper and the author writes “I did a study with 52 participants, and here are the answers”. Would there be any reason to believe that data is real?

      • mmcwilliams 3 days ago

        I agree that the fundamental problem is a misunderstanding about what transformer models produce and how, but people not getting bitten until far down the road is a responsibility that service providers need to address, not everyone else.

        I'm not sure I follow your hypothetical. The author making the claim in a public paper can be contacted for the data. It can be verified. Auditing the internals of an LLM, especially a closed one that, is not the same.

  • supernewton 4 days ago

    I feel like https://xkcd.com/221/ might be heavily influencing what the typical "random" die roll looks like on the internet ;)

    • prerok 4 days ago

      Based on this comic I've seen unit tests use 4 as replacement for random generated number to ensure non flakiness (of course, only when needed). But it might explain the LLM's bias?

    • ngrislain 4 days ago

      Haha, I didn't know that one! It's consistent with OpenAI's conception of a "random" dice roll :-D. Joke appart, I'm quite convinced many people would not find 1 or 6 to look "random" enough to be chosen as an example dice roll.

  • dotancohen 3 days ago

    Like most prejudices exhibited by LLMs, the reported probability for getting a 4 on a die roll is due to biases in the training data. Notably, a popular highly-cited comic hard-coded 4 as the return value of a pseudo-RNG based on a dice roll. I suspect that this influenced the LLM's choice.

    https://xkcd.com/221/

lyu07282 4 days ago

I was under the impression that log probabilities don't work like that / they aren't really useful to be interpreted as probabilities?

https://news.ycombinator.com/item?id=42684629

> the logits aren't telling you anything like 'what is the probability in a random sample of Internet text of the next token', but are closer to a Bellman value function, expressing the model's belief as to what would be the net reward from picking each possible BPE as an 'action' and then continuing to pick the optimal BPE after that (ie. following its policy until the episode terminates). Because there is usually 1 best action, it tries to put the largest value on that action, and assign very small values to the rest (no matter how plausible each of them might be if you were looking at random Internet text)

  • ngrislain 4 days ago

    Yes it is true that the model has undergone SFT, and RLHF, and other alignment procedures, and hence the logprobs do not reflect the probability of the next token as in the pre-training corpus. Nevertheless, in concrete applications such as our main internal use-case: structured data extraction from pdf documents it revealed very valuable. When the value was obviously well extracted, the logprob was high and when the information was super hard to find or impossible the model would output - or hallucinate - some value with much lower logprob.

  • gardnr 4 days ago

    Perplexity: a metric, often used to evaluate LLMs, is derived from the negative average logprob of the tokens in a test set. Lower perplexity indicates that the model assigns higher probabilities to the observed tokens, reflecting better language modeling.

HanClinto 4 days ago

This is really brilliant stuff! Somehow I didn't realize that logprobs were being returned as part of the OAI requests, and I really like this application of it.

Any interest in seeing this sort of thing being added to llama.cpp?

  • HanClinto 4 days ago

    Looking at llama.cpp, it already supports the logprob field in its OAI API emulation, so it shouldn't be too difficult to use this library with it.

    It feels like this would be useful enough to build around -- I especially like the idea of asking the API to return the top K results for each field, and denoting their likelyhood -- almost like a dropdown box with percentages attached for each possible result.

    • DrPhish 4 days ago

      I believe mikupad[0] supports showing logprobs from a llama.cpp backend.

      [0]:https://github.com/lmg-anon/mikupad

      • HanClinto 3 days ago

        Thank you for this link -- I had not seen this before. That is an absolutely gorgeous and intuitive interface!

juxtaposicion 4 days ago

This looks great; very useful for (example) ranking outputs by confidence so you can do human reviews of the not-confident ones.

Any chance we can get Pydantic support?

  • themanmaran 4 days ago

    Fyi logprobs !== confidence.

    If you run "bananas,fishbowl,phonebook," and get {"sponge": 0.76}

    It doesn't mean that "placemat" was the 76% correct answer. Just that the word "sponge" was the next most likely word for the model to generate.

  • ngrislain 4 days ago

    Actually, OpenAI provides Pydantic support for structured output (see client.beta.chat.completions.parse in https://platform.openai.com/docs/guides/structured-outputs).

    The library is compatible with that but does not use Pydantic further than that.

    • juxtaposicion 4 days ago

      Right the hope was to go further. E.g. if the input is:

      ```

      class Classification(BaseModel):

          color: Literal['red', 'blue', 'green']
      
      ```

      then the output type would be:

      ```

      class ClassificationWithLogProbs(BaseModel):

          color: Dict[Literal['red', 'blue', 'green'], float]
      
      ```

      Don't take this too literally; I'm not convinced that this is the right way to do it. But it would provide structure and scores without dealing with a mess of complex JSON.

      • lyu07282 4 days ago

        but this ultimately just converts to json schema, or the openai function calling definition format.

        One question I always had was what about the descriptions you can attach to the class and attributes? ( = Field(description=...) in pydantic) is the model made aware of those descriptions?

Der_Einzige 4 days ago

BTW - Structured/Constrained Generation is the KEY to making AI agents better/scary good. Without it, you're leaving so much on the table. This library is awesome for augmenting that capability!!!!

Also, if you're "studying LLM based chess" and you don't use dynamic grammar's to enforce that models can only make "valid" moves at each time step, you're research is basically invalid.

And don't meme me with claims that structured/constrained generation harms creativity. The devs of outlines debunked that FUD already: https://blog.dottxt.co/say-what-you-mean.html

Similarly, if you think that RLHF/DPO or Lora or any of that harms creativity, you're really outing yourself as not having played with high temperature sampling.

  • ngrislain 4 days ago

    Thank you! Yes indeed, structured output was instrumental in reliably extracting structured data from images from a client.

kelsolaar 4 days ago

I briefly took a look at the code, what is the reason to use Lark and not Python native JSON parser, is it to handle cases where the structured output is not JSON compatible?

  • ngrislain 4 days ago

    We need to build a syntax tree and be able to map each value (number, boolean, string) to a range of character and then to a GPT token (for which OpenAi produces logprobs). This is the reason we use Lark.

potatoman22 4 days ago

How does the token usage compare to vanilla structured output? Many of these libraries do multiple requests to constrain output and measure logprobs.

  • ngrislain 4 days ago

    Same token usage. Actually OpenAI returns the logprob of each token conditional on the previous ones with the option logprobs=true. This lib simply parses the output json string with `lark` into an AST with value nodes. The value nodes are mapped back to a range of characters in the json string. Then the characters are mapped back to the GPT tokens overlapping the character ranges and the logprobs of the tokens are summed.

    • potatoman22 4 days ago

      That's great to hear, thanks for the explanation! Super excited to try this out.