Coding Agents in Large Codebases

January 18, 2026

If your AI agent cannot handle a simple TypeScript ticket, you are not broken. Your boss is not magically smarter either. The gap is not your capability, it’s the context.

I have been there. I spent weeks wondering why others are praising their AI but my AI just didn’t get it, until I read the model thinking traces.

That was the wake-up call I needed that I couldn’t just ask AI to do stuff.

You have to think about how the model thinks. Annoying, I know.

If you’re busy and don’t want the details, skip to the short version at the end.

Why AI fails on codebases

It is mechanics. Mechanics of stochastic machines in particular.

Note: This article focuses on coding agents working inside an IDE like VS Code Copilot or Cursor in an interactive manner. This is not the same as agentic systems, such as Claude Code, which have different approaches.

1. Context windows are theoretical limits

Know your model.

For example, your LLM has a 128k context window. This is a hard ceiling, not a usable target.

Most models degrade well before that, often around 40% to 60% percent usage.

Once alignment drops, output quality collapses. It feels random because it is.

Make sure you’re starting a new conversation for every task.

Keep tabs on how much context you’re using and start fresh often.

2. Your editor decides what the model sees

Tools like Copilot or Cursor inject files automatically into the model context.

There are many benefits as to why, including being helpful.

Here’s why this is a problem: you have no idea what files are being sent.

Let’s role play this out step by step.

You have read some code or have a ticket to work on and decided you need to prompt an AI agent to do a task.
In your mind is a mental model of the code, what it does, and how it fits together, what is relevant, and what is not.
Your editor (e.g., VS Code with Copilot) notes the files you have open, plus some other files in the workspace that it thinks are relevant.
The editor bundles files that it chooses and sends them to the model as context.

See the problem?

Unless you have specifically selected the files included in the context, you cannot be sure that what you are thinking about is in any way similar to what the model is thinking about.

What you are thinking about is likely not what the model is thinking about. This is why you get completely unrelated or nonsensical answers.

Sort out what the model is seeing first.

You don’t fully know what is sent to the model unless you explicitly control it.

Examples of files “helpfully” included are:

Source files (may be unrelated).
Comments, abstractions, and in some cases, factories of abstraction.
Large or binary files by mistake.

Sort out what the model is seeing first. Make sure the model is looking at the same files you would be looking at if you were to do the task manually.

I once caught VS Code trying to upload a 1Gb video file. That prompt was dead on arrival.

3. Your code does not match the training data

LLMs are pattern machines. They are excellent at recognising familiar shapes.

If your code diverges from common public patterns, token cost explodes.

Here is a real example that changed how I wrote code.

Most Python repos online (which the LLM’s are trained on) use this pattern:

from logging import getLogger
logger = getLogger()
logger.info("hello world")

I (still) prefer this format:

from logging import getLogger
log = getLogger()
log.info("hello world")

A while back when I was diagnosing why my AI agents were mis-aligned, I spent a lot of time reading through the model thinking traces.

What I found startled me!

The LLM was confused by log = get_logger() and spent time wondering if this was a custom logging class and implementation.

It would then spend a bunch of time and tokens searching for logging implementations, finding nothing, and then choosing to assume that it was the standard logger but named differently, and to keep going as it were before.

The model got confused. It spent thousands of tokens wondering if log was a custom abstraction.

This is because the concept called logger is entrenched in the training data and the LLM has a harder time mapping my code to its known concept.

Same behaviour. Different name. Massive cognitive tax.

Since then, I’ve stopped using log = get_logger() and started to rely on common and conventional patterns that LLMs thrive on.

It saves me time and tokens.

But, I have lost some of my personal style, which is a bit sad.

Think of the model like a junior developer

This framing changed everything for me.

A junior developer:

Needs narrow, explicit instructions.
Struggles with unconventional abstractions.
Performs better with scaffolding and examples.
Has a lot of ideas, most of which are inconsistent with each other.

LLMs are the same, just faster, biased, and more literal.

If you ask for everything at once, expect garbage.

Separate domains first, then concerns second.

For example:

“Only read code related to the shopping cart checkout process.”
“Build the button UI only.”
“Then wire in the backend logic.”

How I actually get value from AI on real codebases

Here is what works, consistently.

1. Decompose the task aggressively

Do not ask for “build checkout button with backend logic”. That is multiple jobs.

Instead:

Generate the button markup and styling only.
Wire behaviour and side effects in a second pass.

Treat prompts as small, single-purpose functions that should achieve a single task.

This is where Cursor’s multi-step Plan mode and workflows shine really well. They can do many different things at once, however it’s still small, single-purpose tasks.

2. Stub non-standard code paths

If the logic is weird, domain-specific, or novel, stop fighting it.

Go back to your comp sci roots and focus on data structures first, algorithms second.

If you are coding something that is non-standard, your best bet is to have the AI agent stub that method or class out for you, and then you hand code the implementation.

Have the AI:

Stub the method or class.
Write tests and documentation.
Handle surrounding glue code.

You implement the core logic yourself. That is the highest-leverage split.

3. Control context with learning scaffolds

This is the big one.

For existing systems, I generate documentation for the model.

And as a bonus, humans benefit too, if they ever decide to RTFM.

I use a documentation scaffold structure like this:

project_root/
└── docs/
    ├── README.md
    ├── backend/
    │   ├── README.md
    │   ├── OVERVIEW.md
    │   ├── DETAILED_DESIGN.md
    │   ├── DATABASE.md
    │   ├── USAGE.md
    │   └── EXAMPLES.md
    ├── frontend/
    │   ├── README.md
    │   ├── OVERVIEW.md
    │   ├── DETAILED_DESIGN.md
    │   ├── WEBSOCKETS.md
    │   ├── USAGE.md
    │   └── EXAMPLES.md

Each file defines one concept with subsequent files building on previous ones in terms of detail.

Each folder defines a clear, domain-specific boundary, such as backend or frontend.

If I am working on the backend, I have a clear set of docs to teach the model what it needs to know.

For example, I tell the agent (usually in AGENTS.md or directly in the prompt):

“Read docs/backend/README.md first.”
“Follow USAGE.md for behaviour.”
“Do not infer beyond this scope.”

That means if there is similar-but-different code in the frontend (such as Python backend and TypeScript frontend), the model ignores it and doesn’t load all that stuff into the context.

That gives me deterministic context. No guesswork. No pollution.

4. Let the AI write the documentation first

For existing codebases, I have the AI generate the docs first.

This sounds backwards. It is not.

I ask the model to:

Read the code.
Document what exists today.
Explain design decisions as implemented.

Once that is done, future prompts reference the docs, not raw code.

This dramatically improves consistency.

The approach is called Spec-Driven Development (SDD)¹ where the source of truth moves from being the source code to being the specification markdown documents.

To use a Java example, the Java file is compiled to Java bytecode. When changes need to be made, one does not go in and try to flip bits in the bytecode. Instead, you make the change in the Java file, throw away the old bytecode, and generate new bytecode to replace it.

Same idea with C, Rust, and C++ which get compiled to executable object files. The executable object file is not valuable to humans. The source code is.

In SDD, the spec documents are the source of truth, not the code. They’re the valuable human-readable artefact.

If you change the spec documents, you throw away the old code and generate new code using generative AI agents.

Andrej Karpathy calls it Software 3.0.² Highly recommend watching the whole 40-minute video .

The practical reality

AI does not “understand” your codebase. It pattern-matches under severe constraints.

If you manage context deliberately, it performs well. If you do not, it flails.

The difference between success stories and frustration is rarely the person’s intelligence. It is almost always scaffolding and ensuring correct context.

Short version

AI fails because you are overloading it with poorly structured context. Treat it like a highly distractable junior developer and it starts becoming useful.

What you can do today

Context limits are real and much lower than advertised. Use 40%-60%.
Your editor adds files so the LLM models the world differently to you.
Non-standard abstractions confuse models quickly and waste context.
Break prompts into small, single-purpose tasks.
Use documentation scaffolds to control what the AI learns and reads.

If this saved you hours and hours of frustration, good. That is the point.

“Spec-driven development with AI: Get started with a new open source toolkit”, (2025, September 2). Retrieved January 18, 2026 from https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/ ↩︎
“Andrej Karpathy on Software 3.0: Software in the Age of AI (UPDATED with Full Transcript)”, (2025, June 18). Retrieved January 18, 2026 from https://www.latent.space/p/s3 ↩︎