Hidden Limits of LLMs: when they don't work

December 19, 2025

There is so much excitement and hype about generative AI and the large language models (LLMs) that power it. These massive bets assume that LLMs will drive breakthroughs in productivity and innovation; however, from a first-principles view, their architecture has fundamental limitations.

Looking at AI companies in the stock market, the valuations are eye-wateringly high. So much so that I’ve had to recalibrate what ‘high valuation’ means in my mind.

Are we in a bubble? Possibly. Are stocks overvalued? Certainly.

This is not an AI doomer post. I am optimistic with a healthy dose of pragmatism.

I had a wonderful discussion recently with some colleagues and I want to share my perspective in more colour. You may find something of interest here.

TL;DR: We’re likely in an AI valuation bubble driven by hype around LLMs that fundamentally cannot invent new paradigms or reliably disambiguate context.

Great for rewording emails to be polite, generating yummy recipes, and passing benchmarks, risky for complex business problems.

Expect an accelerating shift to diverse, hybrid architectures in 2026.

This post came about because of the news that, at the time of writing, OpenAI is raising $30 billion from SoftBank¹ and is looking for an investment of another $10+ billion² from Amazon to use its Trainium AI chips.

Meanwhile, the nature of LLMs means that they cannot create new concepts.

Bold words, no doubt. Read on to find out why.

Why LLMs cannot create new concepts

LLMs can only reason with the tools that they are given. The transformer architecture that underpins today’s LLMs is based on matrix mathematics. These matrices have a fixed number of parameters and a fixed vocabulary (tokens).

The corollary is that any concept that that has not been previously trained into the model weights will be ignored or discarded by the matrix math that underpins the transformer architecture. It will appear to the algorithm as ’noise’ and will be filtered out.

What this means in practicality: any text submitted to the LLM, whether English or another language, is tokenized into a series of numbers. The LLM can only do math on the series of numbers that it has previously seen during training.

As Vishal Misra, Professor and Vice Dean of Computing and AI at Columbia University, explains, "…any LLM that was trained on pre-1915 physics papers would have never come up with the theory of relativity." ³

Today’s batch of LLM’s work on the basis of numerical tokens at a vast scale of billions of parameters. Concepts are represented as vectors of numbers in a high-dimensional space and relationships between concepts are represented as another vector, being the vertex between the concept vectors.

If the “new” concept does not have a token in the vocabulary, it cannot be represented as a vector in the model and, hence, cannot be computed.

Now, you might say that’s far too absolute of a statement. That’s fair – after all, LLMs today “create” novel combinations of tokens such as new code, new proofs for mathematical problems, new analysis of data, hypotheses not in their training set, as well as rewording our email from “No, go away!” to something more polite.

This technique is called recombination in latent spaces; and, while it is not inventing something ex nihilo, it does accelerate discovery that in ways that feel novel and creative to us.

One could argue that most “new concepts” in business are simply recombinations of existing concepts, and this space is where LLMs excel!

My counter to that argument is that, in theory and after-the-fact, it may be true. A recombination, may be wrong or it may be right, and after a whole lot of testing the rate of false positives may drop down to an acceptable level. All recombination is lossy in some regard and what is omitted out to make the modified concept work is undefined or unknown. The underlying key point here is that finding out the truth is done after-the-fact.

In the “real world” where decisions have to be made without the benefit of after-the-fact backtesting, it is a high risk endeavour to pin one’s business on a recombination of existing concepts, given the rate of false positives that this approach produces.

Let’s take a look under the hood to see how LLMs work in practice - I’ll keep it high-level without the heavy algorithms.

Meet the Teeny Tiny LLM

Assume a tiny LLM with only 16 tokens, each is numbered from 1 to 16 with the number zero representing empty or null space.

The tokens are:

Token	Text	Token	Text
1	`The`	9	`sat`
2	`cat`	10	`stood`
3	`dog`	11	`ran`
4	`horse`	12	`jumped`
5	`cow`	13	`mat`
6	`on`	14	`hill`
7	`the`	15	`grass`
8	`.`	16	`moon`

If our context window is 7 tokens long, then the input text “The cat sat on the mat.” would be tokenized as [1, 2, 9, 6, 7, 13, 8] (refer to the table above).

Similarly, the token sequence [1, 3, 11, 6, 7, 14, 8] would represent the text “The dog ran on the hill.”

While this tiny LLM can produce over 268 million combinations ($$16^7$$), most of them will be garbage, nonsensical text.

We are using this tiny LLM to create English text in the form of “The {x:animal} {y:verb} on the {z:noun}”.

For those with keen eyes and pattern recognition, you will notice that the only sensible, grammatical patterns it can produce are are those in the form:
[1, x, y, 6, 7, z, 8].

Where:

x is one of {2, 3, 4, 5},
y is one of {9, 10, 11, 12}, and
z is one of {13, 14, 15, 16}.

Everything else will be nonsensical or grammatically incorrect.

The next sections cover off what happens when dogs try to sit on mats or elephants try to dance on the moon.

The Vocabulary Barrier

The consequence here is that this tiny LLM can only produce sentences that fit this pattern and can only ever reference the 4 animals, 4 verbs, and 4 nouns that it has been trained on.

It cannot produce a sentence like “The elephant danced on the moon.” because there is no token for “elephant” or “danced”. It also cannot produce a sentence in another structure because that structure is not in its training set nor in its vocabulary.

The LLM can only compute with numbers that it has been trained on, which means that any concept, expressed as a vector of numbers, a relation, or an edge between concepts has to be an already trained number in the LLM. If it is not a pre-trained number, the model simply discards it because it cannot compute a number that is not in its training set.

You might say that the current State-of-the-Art (SOTA) LLMs have hundreds of thousands of tokens, billions of parameters, and a vast vocabulary, so this problem is mitigated.

While it is true that scale does the heavy lifting, the fundamental problem remains.

If the fundamental concept is not in the training set, it cannot be represented as a number in the model and hence cannot be computed. The languages we use are constantly coming up with new words and phrases to represent concepts that have not existed before.

Here are several examples from the last decade or two:

Selfie
Cryptocurrency
Serverless
Fintech
Influencer

In order for the LLM to work with that concept, it will need to have been trained, fine-tuned, or have retrieval-augmented generation (RAG) techniques applied to it. Then, other concepts that it relates to via edges and weights will need to be applied.

This is not an issue if the LLM is solving problems in a known space or a known domain. However, if the LLM is being used to solve novel problems that require new concepts, then it will struggle to do so.

The Probablity Trap

The second problem is that even if the LLM has been trained on a large vocabulary and many sentence structures, it will always revert to the most probable result. This is because the architecture is designed to predict the next token based on the previous tokens, and it does so by calculating probabilities.

In this case, if we had a dog sitting on the mat, our tiny LLM would struggle to select that outcome since the most common phrase is a cat sitting on the mat.

This is related to the alignment problem – we simply can’t look into the black box to see why the result is the result.

If you are using an LLM that has been, hypothetically, trained on text where cats sit on mats, and your business context is that dogs sit on mats, the LLM will always revert to the mean and produce the cat-on-mat result more frequently than the dog-on-mat result.

Meanwhile, you will struggle to understand why the LLM is not producing the dog-on-mat result, because you cannot see inside the black box.

While fine-tuning can help, it is not a holistic solution to complex business problems because the underlying architecture still reverts to the mean. You are simply shifting the mean towards your desired outcome, but the fundamental activity that the LLM is performing is still reverting to the most probable result based on its training and fine-tuning data.

To be really, really clear: fine-tuning and RAG helps to shift the mean of probabilities towards your desired outcome.

Sounds great, but it means you need to know what your desired outcome is in advance and have sufficient data to shift the mean into the range of probable possibilities.

If you need to explore undefined space, you need different machine learning architectures (yes, there are others!) or hybrid architectures that combine LLMs with traditional techniques.

The space of tooling and techniques to address these problems is evolving rapidly, you may come across terms such as Chain-of-Thought (CoT), self-reflection or self-consistency prompting, Mixture of Experts (MoE), and Retrieval Augmented Generation (RAG), to name a few.

We can call an untrained, untuned LLM a “common mode” LLM - that is, it knows things about the world in general but not about your specific business domain. Further training and fine-tuning it on your domain can mitigate a lot of the misalignment.

Key words being mitigate not solve.

LLMs cannot disambiguate context

Humans win on this front hands down. We excel on sparse information, multimodal cues, and theory of mind.

Today’s LLMs struggle to tell the difference between an ambiguous English term that’d only stump a human for a quick second.

For example, take the phrase “I went to the bank.”

Did we mean:

A river bank?
A financial institution?

A human would use sparse context to disambiguate the meaning quickly. These are clues that humans use naturally to understand language and are not necessarily connected to a specific set of words.

For example:

“I went to the bank to deposit my paycheck.” → Financial institution
“I sat on the bank and watched the ducks.” → River bank
If the person is talking about work → Financial institution
If the person is talking about a picnic → River bank
If the day of the week is a weekday → Financial institution
If the day of the week is a weekend → River bank

There is a constant stream of sparse context that underlies the theory of mind that humans use in every moment. That stream of sparse context includes factual information such as the current date, time, location, social setting, etc. Additionally, qualitative information, such as the position of the sun in the sky, the feel of the conference room table, the tone of voice, and body language.

LLMs do not have access to this sparse context, and hence every conversation with an LLM has to explicitly include the relevant context.

Meanwhile, how do we decide what is relevant and what is not?

On one hand, we could dump all the sparse context we have into the prompt but that would quickly exceed the context window of the LLM, making it useless for doing what we actually need it to do. We could do some multi-shot calls tro try to build up context over time, but that is slow and expensive.

On the other hand, if we try to be selective about what context we include, we may miss something important that would have changed how the conversation progresses and what meaning we focus on.

To add to that, once we are deep in our context window, we may have lost the initial information (fell out of the context window) or start losing alignment with the original intent.

An APAC experience

A lot of my work is with the largest enterprises across Asia-Pacific and many of them have dramatically scaled back or pivoted their LLM pilots because of these classes of problems.

In fact, LLM providers are now providing services with forward deployed engineers, in part to help customers get value from their LLM pilots and in part to solve this alignment problem.

What happens once the forward-deployed engineers move on to their next project and the customer needs to adjust the solution?

That’s not to say that LLMs are not useful – they are very useful when applied correctly – however, trying to apply LLMs to complex business problems where novelty and disambiguation are required is fraught with risk.

Today’s LLM’s can make you a recipe or pass a Software Engineering benchmark but they often struggle with complex, ambiguous, or novel business problems.

This is not some theoretical hand-wringing worry. It is playing out in enterprises today and MIT published a study⁴ that found the following:

Generic tools like ChatGPT excel for individuals because of their flexibility, but they stall in enterprise use since they don’t learn from or adapt to workflows

There is a heavy investment in generative AI that uses LLMs, however I am seeing more diverse architectures being trialed that do not rely solely on LLMs. These include graph neural networks and hybrid AI and code architectures.

While valuations reflect the hype, the shift to hybrid architectures is a sign that pragmatism is setting in and that innovation is not solely in scaling LLMs with more parameters and more data.

2026 is going to be a phenomenal year for AI innovation. I am excited for it!

The tag #ai-for-busy-people is a series of articles designed to guide business executives through a learning journey about AI, Large Language Models (LLMs), and prompting.

My aim here is to empower you to understand AI and apply it effectively in your businesses.

SoftBank completes $30B bet on OpenAI: Does it clear path to IPO? (2025, October 27). Retrieved December 19, 2025 from https://techfundingnews.com/softbank-completes-30b-bet-on-openai-does-it-clear-path-to-ipo/ ↩︎
OpenAI in talks with Amazon about investment that could exceed $10 billion (2025, December 16). Retrieved December 19, 2025 from https://www.cnbc.com/2025/12/16/openai-in-talks-with-amazon-about-investment-could-top-10-billion.html ↩︎
Columbia CS Prof explains why LLMs can’t generate new scientific ideas. (2025, November 1). Retrieved December 19, 2025 from https://x.com/rohanpaul_ai/status/1984588319439638557 ↩︎
MIT report: 95% of generative AI pilots at companies are failing. (2025, August 18). Retrieved December 19, 2025 from https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/ ↩︎