The "Silent Failure" in AI Analytics: Why LLMs Hallucinate Numbers

When an AI image generator fails, the error is immediately obvious. People appear with six fingers, text renders as alien gibberish, and buildings melt into the horizon. When an AI code assistant fails, the consequences are similarly loud: the syntax is wrong, the script errors out, or the application crashes. You receive undeniable feedback that something went wrong.

But when a standard LLM analyzes a dataset, the failure is invisible. If your actual customer churn rate is 12% and the AI confidently asserts it is 8.5%, no red warning light flashes. The number is formatted correctly as a percentage. It sits in a sentence that makes grammatical sense. It feels plausible. You might place that metric in a slide deck, present it to your board, and make strategic decisions based on a complete fiction. Unlike bad code that breaks, a wrong data metric looks exactly like a right one.

The Mechanism: Prediction vs. Calculation

To understand why this "silent failure" occurs, we must look at the underlying architecture of the technology. LLMs are probabilistic engines, not deterministic calculators. They function as hyper-advanced autocomplete systems designed to predict the next likely token in a sequence. When you ask a general-purpose chatbot for the capital of France, it does not query a geography database; it calculates that the token "Paris" has the highest statistical probability of following your question. This approach works for language, where nuance is desirable, but it fails for math.

If you paste a messy dataset into a context window and ask for total revenue, a standard LLM treats the numbers as text strings rather than mathematical values. It attempts to predict the next plausible-sounding number based on the linguistic patterns in your prompt. Consequently, large numbers drift, and the model often ignores rows it cannot fit into its memory, leading to incomplete sums. The model is guessing the answer rather than calculating it.

The problem extends beyond probability, it is rooted in how models read text. LLMs do not see numbers as quantitative values. They see them as tokens, or chunks of characters. A standard tokenizer might split the number 4,021 into two distinct tokens: 4, and 021. It might treat 100 as a single token, but 101 as two.

This fragmentation destroys the mathematical relationship between digits. When the model attempts to perform arithmetic on these fragmented tokens, it is effectively trying to do algebra with letters. It loses the concept of place value. By offloading the math to a Python sandbox, we bypass the tokenizer entirely. We pass the raw integers to the CPU, which treats numbers as numbers, not linguistic puzzle pieces.

The Solution: Code Execution

The only way to solve this accuracy problem is to stop treating analysis as a language task. You cannot "prompt" your way out of a math error. At Kepler, we shift the architecture from calculation to orchestration.

We wanted to provide non-technical users with immediate answers from their data. However, asking the LLM to perform arithmetic directly resulted in hallucinations. So, we introduced a Python sandbox. When a user uploads a file, the LLM stops acting as a calculator and starts acting as an analyst. It functions as a reasoning engine that interprets the user's intent (such as finding the trend in CAC) and writes a Python script to extract that answer. The sandbox then executes this code deterministically.

This architecture converts silent failures into loud ones. If the AI writes bad code, the Python sandbox throws an error. Kepler detects that syntax error, self-corrects, and rewrites the script until it executes. Because Python is deterministic, the command sum(revenue) will always equal the actual sum of the column.

Even with Python, errors happen. A column name might be misspelled, or a library might be deprecated. In a standard script, an error ends the session. in Kepler, an error begins the investigation.

We wanted the agent to be resilient, but code generation is inherently fragile. So we built a feedback loop directly into the sandbox. When the agent generates code that returns a KeyError, the system captures the traceback, feeds it back into the context window, and prompts the model to diagnose the specific failure. Then the agent adjusts the column reference and re-runs the script. This loop continues until the execution is successful. The user rarely sees the three failed attempts; they simply see the final, verified answer.

Guarding Against the Lazy Coder

Shifting to code execution introduces a new, subtler risk: the "Lazy Coder". Occasionally, an LLM will try to bypass the calculation to save computational effort. Instead of writing a script to sum the revenue column, it might simply hallucinate a number (e.g., "5000") and write a script that says print("5000"). The code runs without error, but the result is still a hallucination.

We solve this by inspecting the code before execution. The system parses the script to ensure it actually references the dataframe and performs a transformation. If the agent attempts to "hardcode" an answer without deriving it from the source file, Kepler rejects the script and forces a rewrite. The agent must show its work in the code, not just in the output.

Why RAG Can’t Fix Math

When engineers hear about context limits, their reflex is often to implement RAG. They assume that chunking the data and storing it in a vector database will solve the memory problem. But RAG is a search mechanism, not a calculation engine. It is designed to find specific needles in a haystack, whereas data analysis requires weighing the entire haystack.

If you ask a RAG-based system for "total revenue", it retrieves the top-k most relevant chunks of text containing the word "revenue". It does not retrieve every single row in the database. Consequently, the LLM sums only the rows it retrieved, ignoring the thousands it left behind. The answer is mathematically accurate for the subset but factually wrong for the business. True analysis requires iterating over the entire dataset, a task that belongs to code loops, not vector similarity searches.

The Necessity of a Skepticism Layer

A standard LLM operates like a yes-man; it assumes your data is perfect because its goal is to please the user. A human data analyst operates like a skeptic; they assume the data is messy until proven clean. We designed Kepler to mimic the skeptic. Before the agent answers a single question, it profiles the dataset to identify the traps that lead to silent failures: mixed date formats, hidden null values, or duplicate transaction IDs.

We often see users upload files where a "Revenue" column inadvertently contains string values like "USD" or "error". A text-prediction model would likely hallucinate a sum or crash. Our agent detects the anomaly, writes code to strip the non-numeric characters, casts the column to a float type, and reports the exclusion to the user. The agent must understand the shape of the data before it attempts to measure it.

Transparency as a Requirement

Trust is a liability in analytics. If a human analyst handed you a surprising report, you would ask to see their spreadsheet or SQL query. AI must meet the same standard.

This is why we expose "logic logs", the actual steps and code snippets the agent took to arrive at an answer. You need to be able to verify the process. You should see that the agent loaded the CSV, identified that the "Revenue" column contained "Null" values, filtered those anomalies out, and then ran the mean calculation on the remaining clean data. Without visible logic, a metric is unverifiable; if a metric is unverifiable, it is useless for business strategy.

We do not want users to blindly trust Kepler. We want them to trust the data. By replacing black-box prediction with transparent code execution, we remove the "Silent Failure". This allows non-technical teams to stop waiting for data tickets and start investigating their business with the rigor of an engineer, but the speed of a conversation.