Top P

The Top-P parameter is like setting a budget for how adventurous you want the AI to be with its word choices. This parameter selects the next token by selecting from the smallest possible set of tokens whose cumulative probability exceeds top p. Values range between 0.0 (deterministic) to 1.0 (more random). There will always be at least 1 token.

Set a higher Top-P (closer to 1) to give the AI freedom to choose less obvious, more diverse words. A very high Top-P can lead to unexpected and sometimes irrelevant results.
Set a lower Top-P (closer to 0) to tell the AI to stick to more likely, predictable words.

How It Works #

To illustrate Top-P sampling, consider the task of generating the next word following the quick brown fox in a sentence, using a model that has been trained on a broad corpus of English text. Imagine the model provides the following probabilities for the next word:

jumps: 0.4
runs: 0.3
walks: 0.2
eats: 0.05
sleeps: 0.05

With Top-P sampling, instead of selecting a fixed number of likely words, we set a probability threshold, p, that determines how many words are considered based on their cumulative probability. If we set p=0.9, we aim to include the smallest set of most probable words whose cumulative probability exceeds or is equal to 0.9.

In this example, to reach or exceed a cumulative probability of 0.9, we include jumps, runs, and walks, since their combined probabilities (0.4 + 0.3 + 0.2) sum to 0.9. Unlike Top-K, where the number of choices is fixed, Top-P dynamically adjusts the number of choices based on the desired cumulative probability threshold.

These words are then used to create a new probability distribution that sums to 1, maintaining the relative likelihood of each word but narrowing the choices to those most aligned with the model’s predictions and the cumulative probability threshold. This results in a distribution where jumps is still the most likely next word, but runs and walks remain viable options, allowing for varied yet coherent text generation based on a random draw from this adjusted distribution.

Top-P sampling effectively balances diversity and relevance in the model’s output, reducing the risk of including highly improbable words while still allowing for creative and contextually appropriate text generation.