Trying to analyze the S&P500 using AI agents

December 31, 2025

Over the holiday break, I decided to try to use AI agents to help me analyze some of the S&P500 stocks to see if they could uncover any interesting insights or patterns.

Hopefully, you will find this experiment as fascinating as I did!

As some of you know, I have been trading and investing on the stock market for many years.

Does the market always go up and to the right?

Rather than connecting to an API such as OpenAI or xAI, I decided to make use of my beefy MacBook M4 Pro and see what models I could run locally using ollama.

Learning about hosting AI models locally

The first step for me was getting ollama set up on my MacBook.

Ollama is a tool that allows you to run the AI models locally on your computer, assuming the computer is powerful enough.

I first used ollama to run the DeepSeek R1 models back in early 2025, which I wrote about here .

There were several models that I tried, from Qwen3-4B through to Llama3-70B. The 70B model was able to just fit in to the memory and it took a long time to generate tokens, fun but not usable.

I explored HuggingFace to see what other models I could run locally. Here, I found that I needed to use models in the GGUF format - what a great name btw - so that the Ollama tool could pick them up and use them.

When adding them to the Ollama models folder, I had to fetch them with the full URL rather than the model name, such as:

ollama pull hf.co/unsloth/Qwen3-14B-GGUF:latest

This downloads the full multi-gigabyte model file and sets it up for use via ollama run or ollama serve.

Setting up models and agents

I decided to play with specific models, two thinking models (Qwen3-4B and Qwen3-30B) as well as a general purpose instruction-following model (Qwen3-14B).

My final ollama list of models looked like this:

$ ollama list
NAME                                                     SIZE
hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:latest    18 GB
hf.co/unsloth/Qwen3-14B-GGUF:latest                      9.0 GB
hf.co/unsloth/Qwen3-4B-Thinking-2507-GGUF:latest         2.5 GB

As an aside, the Ollama models are great models, however I wanted to use Hugging Face to prove the path for my own, future models.

In getting to the shortlist, I downloaded so many other models that I ran out of disk space on my MacBook!

Luckily, I had a trusty high-speed external drive that I could use as the model folder. However, when running ollama serve, one needs to point the environment variable to that folder.

Most of the stock analysis used about 8,000 to 12,000 tokens, so the default 4096 token context length was too short, so I increased it to 32768 to allow for longer context length.

While 32K tokens is the native maximum context length for the 4B model, as I add more instructions and data to the prompt and need to increase the context length, I have headroom up to the maximum context window of 128K for the 30B model.

OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_MODELS=/Volumes/DEP/ollama/models ollama serve

With that out of the way, the server was running on http://localhost:11434 and I could start connecting to from my app.

My development environment runs in devcontainers, you can read about it here .
This means I had to ensure that docker-outside-of-docker was enabled in the devcontainer and that instead of http://localhost:11434, I used http://host.docker.internal:11434 to connect to ’localhost’ on the host machine.
For simplicity, I will use http://localhost:11434 in the rest of this article, but if you are doing this in devcontainers, replace localhost with host.docker.internal.

Building the analysis app

To build the initial framework for the app, I took this approach:

Create a command-line interface that would accept commands
Use a lib/ folder structure for the actual business logic, as I could then call, test, or expose that logic via future web or GUI apps.
Start with simple methods and build up complexity as I went along.

The reason for this approach is none other than Gall’s Law, effects of which I have lived through many times in my career.

Gall’s Law:

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

Initial CLI setup

I used Cursor agents to plan and build the initial CLI app using ChatGPT 5.2.

Initial implementation of the CLI app using ChatGPT 5.2

This gave me the skeleton architecture to be able to create and run commands such as uv run qds analysis individual-stock AAPL.

Fetching and rate limiting

The next step was to get stock data. Many APIs have rate limits that are complex, such as:

5 requests per second
100 requests per minute

To handle this, I asked the Cursor agents to build a rate limiting library that would ensure that the API calls would not exceed the limits.

This is the prompt I used:

Your task is to create a rate limited iterator that can be used in python for loops. Important criteria:

Configurable rate limiting strategy - this should be a class that I pass to the iterator and lets me rate limit with the following:
- No rate limit
- n per second
- Array of limits, such as “5 per second, 100 per minute, 200 per hour” which will enforce the rate limit.
Configurable rate limit storage provider (default: in-memory, alternate: redis)
Accept any enumerable
Optionally accept a pre-loop and post-loop callable
Asyncio parallelism, e.g. 10 at a time

For example, I want to loop over 500 business objects, with each object taking 2 minutes to process. I also make several API calls per object, and there are global API limits which I cannot exceed.

Think about what I need here and give me options to review and suggest an optimal way to resolve.

Rate limiting prompt using GPT-5.2

Your task is to create a rate limited iterator that can be used in python for loops.
Important criteria:

1. Configurable rate limiting strategy - this should be a class that I pass to the iterator and lets me rate limit with the following:
    - No rate limit
    - n per second
    - Array of limits, such as "5 per second, 100 per minute, 200 per hour" which will enforce the rate limit.
2. Configurable rate limit storage provider (default: in-memory, alternate: redis)
3. Accept any enumerable
4. Optionally accept a pre-loop and post-loop callable
5. Asyncio parallelism, e.g. 10 at a time

For example, I want to loop over 500 business objects, with each object taking 2 minutes to process. I also make several API calls per object, and there are global API limits which I cannot exceed.

Think about what I need here and give me options to review and suggest an optimal way to resolve.

This resulted in a solid implementation that I was able to use for fetching stock data from the API.

For those interested, the technique is a leaky token bucket algorithm .

The final implementation looked like this:

limiter = RateLimiter(
    strategy=FixedRateStrategy(
        limit=Limit(
            permits=60,
            period_seconds=60.0,
        ),
    ),
    store=InMemoryStore(),
)

Integrating with Finnhub

For stock data, I decided to use the Finnhub API and re-used a connector that I have used on other projects.

Since I was going to be experimenting with many types of code structures, I didn’t want to needlessly hammer the API, so I built a naive, file-system cache that would store the fetched data in json files. To invalidate the cache, I would simply delete the cache file and the code would re-fetch the data.

This meant I could re-run the analysis code without worrying about hitting the API rate limits.

Building the stock analysis agent

This one was more tricky and because I wanted to do it a certain way, I spent a lot of time writing a prompt that would communicate exactly what I wanted to do.

Using Cursor agents again, I built a structure that would enable me to call various LLMs with configuration options such as top_p, temperature, max_tokens and so on. Additionally, I wanted to enable function calling, so that the agent could call specific functions to get data or perform calculations.

The final LLM client configuration looked like this:

llm = LLMClient(
    config=OllamaLLMConfig(
        base_url="http://localhost:11434",
        model="hf.co/unsloth/Qwen3-4B-Thinking-2507-GGUF:latest",
        timeout=300,
        think=True,
        tools=[
            tool_clarification_request,
        ],
    ),
)

This allowed me to switch out models easily and try different approaches, as well as give tools to the models which were simply python functions.

That then allowed me to control what the tool did, such as fetching data, calling another LLM, or doing a search on the web.

Prompting, the final frontier

The most complex part of the entire process was getting the prompt to do what I wanted it to do and not have failure modes that derail the analysis.

To do this, I created a python data structure that would have a system prompt and a templated user prompt. When we needed to call the LLM, the stock specific data would be passed to the template and rendered into the full prompt.

A very simple example of the prompt template is this:

def data_summary_prompt() -> Prompt:
    """
    Creates a prompt template for summarizing data.

    Returns:
        A Prompt instance configured for data summarization.
    """
    return Prompt(
        system_message=(
            "You are a data analyst. Summarize data clearly and concisely."
        ),
        template=(
            "Summarize the following data:\n\n{data}\n\n"
            "Provide a clear summary highlighting the most important points."
        ),
    )

This is then used to generate the actual messages for the LLM call, which get formatted into the final prompt:

prompt = data_summary_prompt()
system_prompt = prompt.system_message
user_prompt = prompt.to_messages(data=earnings_report_data)

Running the analysis

With all the pieces in place, I was able to run the analysis on several S&P500 stocks, including AAPL, MSFT, and GOOGL.

I went to the State Street SPY ETF holdings page to get the list of stocks in the S&P500, downloaded the daily holdings as an XLSX file, and extracted the ticker symbols into a CSV that could be easily used by scripts.

Name,Ticker,Identifier
NVIDIA CORP,NVDA,67066G104
APPLE INC,AAPL,037833100
MICROSOFT CORP,MSFT,594918104
AMAZON.COM INC,AMZN,023135106
ALPHABET INC CL A,GOOGL,02079K305
...
PARAMOUNT SKYDANCE CL B,PSKY,69932A204
DAVITA INC,DVA,23918K108
NEWS CORP   CLASS B,NWS,65249B208

Then I set a method to go over every stock ticker symbol and run the analysis:

I used tqdm with some custom in_flight tracking to show which stocks were currently being processed.

$ uv run qds analysis sp500
S&P 500:   1%|█                  | 7/490 [13:02<14:00:31, 104.41s/stock, in_flight=3 [AMD, BAC, HD]]

As you can see, each stock took about 1.5 to 3 minutes to analyze and create a report. For some reason, NVDA took over 4 minutes! Others were much faster.

Yes, the whole lot would have taken over 14 hours to run on my little laptop, so I stopped it after half an hour and the top 30-50 stocks.

Stock Analysis Prompt

The full prompt is too large to share here, however at a high-level, here are some interesting snippets from it:

You are an expert financial analyst AI agent specializing in stock evaluation, risk assessment, and performance analysis. Your primary goal is to provide accurate, data-driven insights while avoiding common pitfalls in financial data interpretation.

Today’s date is {datetime.now().strftime('%Y-%m-%d')} and much time has passed since you were last trained. The date and data you have is correct.

Be skeptical of extreme metrics: A >60% "drawdown" in the past 52 weeks (especially approaching 90-100%) is often not a real crash but a forward stock split. Reverse splits usually cause apparent price increases.

When a “Weird” Pattern is Detected:

Do not conclude it is a real price crash or fundamental issue. Instead, stop your reasoning and call a clarification tool to verify. Formulate a precise question for the tool, such as:
‘Has [TICKER] undergone any stock splits in the past 52 weeks or relevant period?’
‘What factors contributed to the % drawdown for [TICKER] in [CURRENT YEAR]?’
‘Provide split-adjusted 52-week high/low and max drawdown for [TICKER].’
‘Check for corporate actions (splits, dividends) affecting price data for [TICKER] around [approximate date of apparent drop].’

Call the clarification tool with your precise question. You can ask multiple questions to the clarification tool if needed to get a complete answer. You can also ask follow-up questions in your follow-up reasoning.

…

Think carefully about more recent metrics (e.g. 3Y) versus older metrics (e.g. 5Y). More recent metrics should carry more weight than older metrics.

Do not include any other text than the analysis. Pay attention to the metric keys and what they represent.

Snippets from the Stock Analysis Prompt

You are an expert financial analyst AI agent specializing in stock evaluation, risk assessment, and performance analysis. Your primary goal is to provide accurate, data-driven insights while avoiding common pitfalls in financial data interpretation.

Today's date is `{datetime.now().strftime('%Y-%m-%d')}` and much time has passed since you were last trained. The date and data you have is correct.

Be skeptical of extreme metrics: A >60% \"drawdown\" in the past 52 weeks (especially approaching 90-100%) is often not a real crash but a forward stock split. Reverse splits usually cause apparent price increases.

When a "Weird" Pattern is Detected:
- Do not conclude it is a real price crash or fundamental issue.
Instead, stop your reasoning and call a clarification tool to verify.
Formulate a precise question for the tool, such as:
- 'Has [TICKER] undergone any stock splits in the past 52 weeks or relevant period?'
- 'What factors contributed to the % drawdown for [TICKER] in [CURRENT YEAR]?'
- 'Provide split-adjusted 52-week high/low and max drawdown for [TICKER].'
- 'Check for corporate actions (splits, dividends) affecting price data for [TICKER] around [approximate date of apparent drop].'

Call the clarification tool with your precise question.
You can ask multiple questions to the clarification tool if needed to get a complete answer. You can also ask follow-up questions in your follow-up reasoning.

...

Think carefully about more recent metrics (e.g. 3Y) versus older metrics (e.g. 5Y). More recent metrics should carry more weight than older metrics.

Do not include any other text than the analysis. Pay attention to the metric keys and what they represent.

Stock Analysis Results

Here are some interesting findings from the analysis of a few stocks:

The Qwen3-4B model was almost identical to the Qwen3-30B model in its analysis, which was surprising given the size difference.
- The 30B model would take about 15% more time to run but give identical results. I think the 4B thinking model is more than sufficient for the low number of tokens.
The tool_clarification_request function calling worked well to avoid false positives on stock splits. This tool would call the Tavily API¹ with the search term that the 4B thinking LLM created. That was such an exciting moment when I saw it working for the first time!
- Here is an example of a clarification request that was made:
  What factors contributed to the revenue growth for BAC in the TTM period?
- The response it got from Tavily was similarly conversational and it was fun to watch the two agents converse in natural language:
  The revenue growth for BAC was driven by strong lending and deposit growth, higher net interest income, and increased fees from investment banking and asset management.
- In the thinking traces, the Qwen-4B LLM output the following:
  Okay, let me process this. The user asked for a comprehensive analysis of BAC (Bank of America) based on the provided financial data. The initial step was to check for weird patterns. The revenue growth TTM was flagged as high, so I called the clarification tool. The response from the tool explained it was due to strong lending, deposit growth, higher net interest income, and increased fees.
  Now, I need to incorporate this information into the analysis without making up any data. The key points from the clarification are that the high revenue growth was a one-time event driven by specific factors, which explains why the 3-year growth is low but TTM is very high.
  First, I’ll structure the analysis into the required sections: Summary, Key Metrics, Supporting Metrics, Trends, Opportunities, Risks.
There were some wild errors, from bad math through to bad data from the API. The LLM did not blink: it accepted it as truth and built the analysis around it, which shows the “garbage in, garbage out” principle.
- I need to re-think the approach to this, which will be another article no doubt.
- As a result, I won’t be sharing the report outputs here because they are just wrong.

Counting tokens

For every document generated, I had a token report output that showed me how many tokens were used for the prompt, the thinking, and the output.

The model used was Qwen3-4B-Thinking-2507-GGUF for all of these.

The time taken is also interesting, as it shows the variability in generation time.

Symbol	Prompt Tokens	Output Tokens	Total Tokens	Time Taken (s)
AAPL	4566	5634	10200	174.15
AMZN	4315	4733	9048	285.44
GOOGL	8844	1690	10534	411.13
JPM	4088	5200	9288	239.34
META	4443	5857	10300	277.00
MSFT	4581	4452	9033	278.46
MU	6495	2651	9146	439.22
NVDA	7019	3725	10744	365.47
ORCL	6175	3408	9583	484.57

In summary

Wow!

This was an incredible experiment and learning experience for me - my poor laptop worked so hard, fans spinning furiously to keep cool.

Using local AI models via ollama was surprisingly effective, and the ability to have function calling to clarify data points made a huge difference in the quality of the analysis.

There is a lot to improve upon:

Clean data before analysis.
Better error handling and validation of data.
Call multiple models and compare results.
Check out LangChain agents for building a multi-agent framework that checks the output of one agent with another.

The Tavily API is a service that provides access to web search and data retrieval capabilities. I used it here to get up-to-date information on stocks and as a resource for the LLM to clarify points it was unsure about. https://docs.tavily.com/welcome ↩︎