Teaching an LLM to Read a Profiler

October 26, 2024

What happens when you hand an AI real performance numbers and asking it to do something useful with them?

Didn't I have something better to do?

LLMs already reason through chain of thought using text. Models produce intermediate "thoughts" and reuse those as inputs to a better final output. But text‑only loops miss what actually counts: measurements that come from outside the language model. When the model can act (e.g. compile code, run a benchmark, read a sensor) and then fold those results back into its next thought, “chain of thought” turns into chain of action → measurement → reflection.

That opens the door to longer arcs: ship a patch, wait for production metrics, come back with a smarter patch; spin up a lab experiment, watch the overnight results, redesign the protocol. My 1BRC sandbox is deliberately small, but the mechanic is the same: a feedback loop tight enough for the model to see the world push back and adjust accordingly.

Tooling reality check

I think everyone's least favorite part about writing code is environments and tooling... I started in C++ because the goal was to be fast. Then I remembered that macOS likes Clang and considers GCC optional. Instruments shows pretty graphs but won’t cough up plain text. I switched to Python + Scalene for one reason: the profiler prints something an LLM can read.

The control loop

Start with seed code
Run the code
Check if it passes tests:
1. If no, collect traceback
2. If yes,collect Scalene report
Send the collected information to Claude to suggest a minimal edit
Use the suggested edit to update the seed code, and repeat the process

Each cycle:

Run on a 1 M‑row sample (baseline 0.72 s).
Save either the error trace or the profiler table.
Ask the model for the smallest possible change.
Keep the new champion if it’s faster and still correct.

Progress (minus the confetti)

Iter	Time (s)	Model’s headline tweak
0	0.72	Baseline
2	0.42	Use `mmap` instead of `readline()`
5	0.32	Slice bytes directly; skip `split(',')`
7	0.22	Reuse a single buffer
9	0.13	multiprocessing (2 cores)

Things that needed guardrails

Syntax errors — eventually fed the raw compiler output back into the prompt and asked for fixes.
Performance regressions — compared against the standing record and veto slower patches.
Full rewrites — ask for line‑level edits only; the diffs got smaller and the code got safer.

Reinforcement learning?

Right now it’s more like evolutionary search: generate variant, test, keep the fittest. A proper RL setup would treat runtime and memory as a reward, and adjust the policy weights. The ingredients are here; the recipe isn’t finished.

Where to push next

Full 1B rows — streaming IO; maybe hop back to C++ once the tooling is tamed.
Cross‑language editing — Python loop, C++ target code.
Built‑in unit tests — validate correctness before the expensive run.
Public sandbox — paste code, watch an AI trim the fat.
Beyond parsing — same loop on CUDA kernels, SQL queries, or anything else with a measurable objective.

Takeaway

Give a model real numbers and it stops hallucinating improvements and starts earning them. Today that saved a few hundred milliseconds. Tomorrow it could shave weeks off a simulation schedule. Not bad for a few lines of glue code and a profiler.

Aspirationally, I hope we see LLMs get real feedback as much as possible. Executing code and connecting to real world sensors are the holy grail that can ground new models and lead to much better chains of thought.

Siddharth Ramakrishnan

Writing