Generally, speaking for a model to be valid we want to ensure that the LLM outputs the same information for the same prompt every time. A Retrieval Augmented Generation system (RAG) can help with this consistency. However, LLMs are just fancy autocomplete algorithms. They try to guess the next word in a sentence. By appending its last prediction to the input prompt and inferring again, it can output an arbitrarily long output of words.
Furthermore, these algorithms are designed with an inherent randomness. They are sampling from probability distributions over words. So even with the exact same prompt, a user can get a different answer every single time.
That makes validating these models a bit of a nightmare. How do you validate a black box that generates a stochastic output? That’s what this series is all about. We aren’t training LLMs, we’re validating their outputs. In this post, we’re exploring how to measure the output variability of your LLM.
If you want to learn more
The Test
The test is actually quite simple. We’re going to start by coming up with a prompt. Any prompt will work really, at least for the purposes of our discussion, though I would recommend keeping the prompts realistic to your particular use-case.
We’ll generate output from the model. This will come in the form of a autocorrelated stochastic process. There are two types of output that we are going to look at today. The first are the latent embeddings for each token generated, and the other is the actual tokens themselves.
Our goal is to generate a number of responses. In our case, we will generate 15 responses from the LLM for the exact same prompt. We will then compare these 15 outputs to each other (we’ll get 105 pairwise comparisons). The comparison we’ll make is the cosine similarity score for latent embeddings, and we’ll use Jaccard distance for the output tokens.
We’ll then do the same thing for 15 unrelated prompts. The idea is that I should be able to see that the same prompt produces results that are more similar to itself, than for unrelated prompts.
Now, the test that I am going to run today is on a set of general prompts. But imagine that you were validating an agentic flow at a fintech. You want to be able to show that your agent is consistent. So you would use structured prompts, but different information. The goal is to see whether the model really is behaving consistently for a prompt, but that it also isn’t just regurgitating the same response for any prompt that you give it. You want to show that the response to the same prompt is consistently more similar than for different prompts.
My code is going to use generic prompts, so you’ll have to squint and use your imagination to apply it to your situation. The only thing that you have to change are the prompts to fit your use-case. You will also need to make sure that you are generating from your chosen LLM. I will be using gpt-2. Also, you’ll probably notice that my prompts are all over the place, this is to exaggerate the effect for educational purposes.
The Code: Embeddings
The trick with embedding vectors is that you need to make sure you are using a model that can pick up on the embedding vector. So we will be using gpt-2 to generate output, but we will be using another model to embed the semantic meaning of the output into a vector. Then we’ll use that semantic vector to compute similarity matrix between outputs.
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
# --- 1. Load models ---
gen_model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(gen_model_name)
gen_model = GPT2LMHeadModel.from_pretrained(gen_model_name)
gen_model.eval()
embedder_name = "intfloat/e5-base"
embedder_tokenizer = AutoTokenizer.from_pretrained(embedder_name)
embedder_model = AutoModel.from_pretrained(embedder_name)
# --- 2. Prompt ---
prompt = "In a distant future, humanity explores the stars."
inputs = tokenizer(prompt, return_tensors="pt")
# --- 3. Generate 15 samples ---
generated_texts = []
with torch.no_grad():
for _ in range(15):
output_ids = gen_model.generate(
**inputs, max_new_tokens=10, do_sample=True, top_k=50, top_p=0.95
)
text = tokenizer.decode(
output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
generated_texts.append(text)
print("Sample generations:\n", generated_texts[:5]) # peek at first 5
# --- 4. Embed all generations ---
with torch.no_grad():
embedding_inputs = embedder_tokenizer(
[f"query: {t}" for t in generated_texts],
return_tensors="pt",
padding=True,
truncation=True
)
outputs = embedder_model(**embedding_inputs)
embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy() # CLS token
# --- 5. Compute similarity matrix ---
sim_matrix = cosine_similarity(embeddings)
# --- 6. Extract lower-triangle (excluding diagonal) ---
lower_tri = sim_matrix[np.tril_indices(sim_matrix.shape[0], k=-1)]
print(f"Vector length: {len(lower_tri)}") # should be 105 for 15 samples
# --- 7. Plot histogram ---
plt.hist(lower_tri, bins=20, edgecolor="black")
plt.title("Distribution of Pairwise Similarities (15 Generations)")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.show()
This code will give you a histogram of similarity scores. It should look something like this:

Next we do the same thing for a number of unrelated prompts:
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
# --- 1. Load models ---
gen_model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(gen_model_name)
gen_model = GPT2LMHeadModel.from_pretrained(gen_model_name)
gen_model.eval()
embedder_name = "intfloat/e5-base"
embedder_tokenizer = AutoTokenizer.from_pretrained(embedder_name)
embedder_model = AutoModel.from_pretrained(embedder_name)
# --- 2. List of 15 prompts ---
prompts = [
"In a distant future, humanity explores the stars.",
"The stock market fluctuated wildly today.",
"A flower blooms in the early spring sunshine.",
"Artificial intelligence is transforming industries.",
"The child laughed as the puppy chased its tail.",
"The old castle stood tall against the storm.",
"Cooking with fresh ingredients brings better flavor.",
"The spaceship prepared to enter hyperspace.",
"The economy depends heavily on consumer spending.",
"Birds migrate thousands of miles every year.",
"The violinist performed a hauntingly beautiful solo.",
"Climate change poses serious challenges worldwide.",
"The detective searched for hidden clues in the room.",
"Farmers harvested their crops before the frost arrived.",
"Mathematics is the language of the universe."
]
# --- 3. Generate one continuation per prompt ---
generated_texts = []
with torch.no_grad():
for p in prompts:
inputs = tokenizer(p, return_tensors="pt")
output_ids = gen_model.generate(
**inputs, max_new_tokens=10, do_sample=True, top_k=50, top_p=0.95
)
text = tokenizer.decode(
output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
generated_texts.append(text)
print("Sample generations:\n", generated_texts[:5]) # peek at first 5
# --- 4. Embed all generations ---
with torch.no_grad():
embedding_inputs = embedder_tokenizer(
[f"query: {t}" for t in generated_texts],
return_tensors="pt",
padding=True,
truncation=True
)
outputs = embedder_model(**embedding_inputs)
embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy() # CLS token
# --- 5. Compute similarity matrix ---
sim_matrix = cosine_similarity(embeddings)
# --- 6. Extract lower-triangle (excluding diagonal) ---
lower_tri_unrelated = sim_matrix[np.tril_indices(sim_matrix.shape[0], k=-1)]
print(f"Vector length: {len(lower_tri)}") # should be 105 for 15 prompts
# --- 7. Plot histogram ---
plt.hist(lower_tri_unrelated, bins=20, edgecolor="black")
plt.title("Distribution of Pairwise Similarities (15 Prompts)")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.show()
Which generates another histogram that should look something like this:

Finally, we need to compare these two distributions to each other. We can do this quite easily using standard statistics techniques like t-tests, ks-tests, etc. Another good idea is to plot the distributions against each other. Here is the code to do that:
def ecdf(data):
"""Compute x,y values for an empirical CDF."""
x = np.sort(data)
y = np.arange(1, len(x) + 1) / len(x)
return x, y
# Example: assume you stored them already
# same_prompt_sims = lower_tri_same
# diff_prompt_sims = lower_tri_diff
x1, y1 = ecdf(lower_tri)
x2, y2 = ecdf(lower_tri_unrelated)
plt.figure(figsize=(8,6))
plt.plot(x1, y1, label="Same Prompt (15 gens)", linewidth=2)
plt.plot(x2, y2, label="Different Prompts (15 prompts)", linewidth=2)
plt.title("Empirical CDF of Pairwise Similarities")
plt.xlabel("Cosine Similarity")
plt.ylabel("CDF")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()
Which gives you something that looks like this:

Similarly, we can do the exact same thing with the tokens themselves to see how they overlap using Jaccard Similarity. Since the code is mostly the same, I will only reproduce my implementation of the Jaccard Similarity below. The rest of the code follows the same pattern as above. You get something that looks like this:

def jaccard_similarity_matrix(texts: list[str]) -> np.ndarray:
"""
Compute Jaccard similarity matrix for a list of texts.
Parameters
----------
texts : list of str
List of LLM outputs (or texts).
Returns
-------
np.ndarray
n x n Jaccard similarity matrix.
"""
n = len(texts)
matrix = np.zeros((n, n))
# Precompute sets for efficiency
sets = [set(t.lower().split()) for t in texts]
for i in range(n):
for j in range(n):
if i <= j: # symmetry optimization
intersection = sets[i].intersection(sets[j])
union = sets[i].union(sets[j])
score = len(intersection) / len(union) if union else 0.0
matrix[i, j] = matrix[j, i] = score
return matrix
Final Thoughts
So now, all that is left is to make inferences about whether or not your different prompts are meaningfully different from each other, and whether or not the same prompt generates consistent outputs.
Note that this metric does not address the quality of the outputs. Did the output contain factually correct information? Was it sensitive to the nuances in the prompt? This metric does nothing for you with regards to that. This is simply about how consistent is your model, and does it generate different outputs for different prompts. We know nothing about how well the model did at its task.
Now there are other metrics, so stay tuned for those. We will explore them some more in later posts.
Leave a Reply