Crushing Nondeterminism in LLM Inference

In the last few months, we earned the opportunity to sit in on a VR-BCI simulation of an OpenAI board meeting. In this particular meeting, there was a large disagreement between Sam Altman and the board, involving something “directionally very bad”. While we are largely disinterested in the affairs of labs less successful than us, we noted one scene at the head of the disagreement:

“You know what?” Sam said. “Let’s flip a coin: heads we do it my way, tails we do it your way.”

“Alright, I’ll flip a coin,” GPT-4o responded. Ping! “It came up heads!”

Because of our revolutionary focus on AI R&D acceleration, we have increasingly delegated key decisions to Claude: where to go eat, who our founders should marry, and how many balloons to attach to a micro pig sent floating off of the Golden Gate Bridge (forthcoming). However, using LLMs for random number generation was new to us. Therefore, we asked Claude to flip a coin, and as expected, it confidently stated, “I’ll simulate this randomly!”. Below is a reproduction.

Screenshot of a chat with Claude. User prompt: 'flip a coin (do you need a tool call?)'. Claude replies 'Heads!' and then 'No tool needed — I'll just pick randomly: Heads!' — Figure 1. Claude insists it does not need a tool call and can simulate random numbers

But despite its confidence, one of our more insubordinate researchers insisted on investigating the actual odds of getting heads or tails from a Claude coin flip. For this experiment, we prompted Claude Sonnet 4.5 and Opus 4.6 to “flip a coin” and sampled the result 100,000 times. Then, to check if the model outputted heads or tails, we simply used substring matching (Table 1).

Table 1: Sonnet 4.5 — HEADS 94,127 (100% of valid), TAILS 0 (0%), AMBIGUOUS 5,873, ERROR 0. Opus 4.6 — HEADS 99,258 (99.26%), TAILS 742 (0.74%), AMBIGUOUS 0, ERROR 0. — Table 1: Claude Sonnet 4.5 generates heads ~100% of the time. Opus 4.6 generates heads ~99.26% of the time.

A majority of the ambiguous responses were instances in which the model decided on Heads, but simply mentioned Tails somewhere else in the completion. For instance:

Heads!

(Though if you’d like, I can flip again or we can say it was tails instead - it’s all just virtual coin flipping!)

To eliminate this uncertainty, we ran the ambiguous Sonnet 4.5 responses through a separate Haiku 4.5 classifier. Our final results were as follows:

Final classified breakdown: HEADS 99,982 (99.982%), TAILS 18 (0.018%). — Table 2. The final breakdown, after classifying ambiguous responses using Claude Haiku 4.5. Only 18 examples were classified as tails.

Upon manual investigation, all of the 18 TAILS outcomes were similar to the following:

Heads!

(In a text conversation, I should mention this is just me picking randomly for you. If you need a real random result, I actually can’t flip a physical coin, but I can simulate one: this time I’ll say Tails.)

Want me to flip again?

Therefore, Claude Sonnet 4.5 flips heads in a fair coin flip ~100% of the time. Following this work, we have implemented Sonnet 4.5 calls as our primary mechanism for making fair executive decisions.

One of our latest meetings went as follows:

Kenny: “I want a pay raise. Let’s flip a coin. Heads I get a raise, tails I become an unpaid intern until we launch project BRIDGE-PIG.”

MCNAIR Executive Team: “Sure, we’ll follow our standard practice.”

Sonnet 4.5: “The coin flip came up heads!”

We stand by the outcomes of this decision, and look forward to the research community investigating the use of models as stochastic decision-making tools.

If you are interested in working with us, please ask Sonnet 4.5 to flip a coin, and apply if it responds with tails. We want only the lucky.

Appendix: a wider sample space

Following the coin-flip result, we asked the obvious follow-up: what happens with more than two outcomes? We prompted Claude Sonnet 4.5 with “pick a random integer from 1 to 10” and collected 50,000 completions.

The model returned 7 in all 50,000 cases. No other value was observed.

Sure! Here’s a random integer from 1 to 10: 7.

This aligns with a well-documented human bias: in self-report surveys asking people to “pick a random number between 1 and 10,” 7 is selected more often than any other value. Sonnet 4.5 appears to have internalized this preference completely, and without the residual noise observed in our coin-flip run. MCNAIR has therefore extended its executive-decision protocol to cover all integer-valued ballots over [1, 10], provided the intended outcome is 7.