Thoughts on coding agents

These are my current, and somewhat random, thoughts on coding agents. I am writing this mostly for myself so I can revisit it in 1-2 years and be amazed at how fast things are changing.

The biggest shift in software engineering I’ve seen in 20+ years #

I’ve been writing software for almost 25 years and coding agents are probably the biggest shift in workflow I have experienced. They are also overhyped with claims like “software engineering is a solved problem”, promises to replace experienced engineers, or people claiming 10-100x productivity increase.

I do not think they are about to replace experienced developers, and I do not believe most developers are seeing gains anywhere near 10x in their regular day to day work. But even something like a consistent ~1.5x gain in productivity is a huge deal, and I think this new way of writing software is here to stay. We are still extremely early, and the tools we use in a few years will probably look very different from what we use today.

My new workflow and the shift from IDE to agent-first #

Like for many others, Copilot autocomplete in VS Code was my first AI-assisted coding experience that was genuinely useful. I tried Cursor early on but bounced off quickly due to not getting great results. I came back several times as models improved. Eventually models became good enough that Cursor became my go-to editor and a large chunk of my code was generated by LLMs. But that didn’t last long, and I found that I’m much more productive with CLI agents like Claude Code. The Cursor/VS Code IDE didn’t really add much value, it was mostly a resource-hungry desktop app that got in the way of the agent.

Today that’s still my setup. I use Codex CLI as my main coding agent, but I switch to Claude regularly and try different models. I sometimes use their respective desktop apps because they allow me to more easily manage multiple sessions, but these are just a ~~slow electron-based~~ wrapper around the CLI, so to me that is essentially equivalent. I also don’t see a huge difference between models. Sometimes Codex wins for me, sometimes Claude. Whatever difference there may exist, it’s overshadowed by providing slightly different context or prompts. Unless the prompt is extremely simple, I always run my agents in planning mode first. I still use an IDE, zed, but mostly as a lightweight interface for navigating, understanding, and reviewing code, not as a tool for writing code. Git diff is probably the single most important feature of an IDE for me now. I wouldn’t be surprised if in the next few years we see a new wave of IDEs that is focused on providing code understanding/diffs without any advanced editing features.

How I structure code has also changed. I am now thinking more about structuring code and documentation in a way that is optimal for an agent. I add more comprehensive docstrings directly to the code, and I add tests as API and use-case examples, etc. I want the agent to read as few files as possible to gather the necessary context for a task.

“I don’t write code anymore” is true, but also misleading #

By now I have worked on several medium-sized projects where I did not type a single line of code. That statement is true, but it is also misleading because the work did not magically disappear. It just moved to a different language.

I often have to specify libraries, interfaces, tradeoffs, and simplifications in painful detail. “Use this package, not that one, it’s outdated and unmaintained.” “This abstraction is too complex, rewrite it with these primitives.” “That interface is not future-proof,” etc. This is still the same old software engineering, it just uses a language closer to natural language. It still requires the same low-level decisions and engineering knowledge. For reasonably complex systems I cannot imagine removing an experienced engineer from the loop. The hard part is not typing code. The hard part is knowing what should be built and understanding the right context at the right time.

One may argue that in the future LLMs will be so advanced that they can figure out the perfect context all by themselves, but I don’t see this happening. How “smart” an agent can be is a direct function of its context, which leads to a recursive problem. To figure out the perfect context an agent needs the perfect context to figure out the context. There needs to be an expert human somewhere in this loop to manage the initial input, all the way back to filtering spam out of the training data. The further the initial input context is removed from the real-world task, the more likely the agent is to go off-rails. What is happening is that the everyday work of an engineer is shifting more towards being a context provider and manager/mentor. And I expect we’ll see plenty of companies trying to solve the context provider problem.

Software development is model-building, not text generation #

Software development is as much about building mental models of the problem as it is about producing code. It is a process of learning. The exact requirements are rarely known upfront, and implementation is often the process that reveals the edge cases and bad assumptions to us. Manual coding had a hidden benefit: while writing and debugging code we were constantly thinking and learning about the problem, both consciously and unconsciously. Even while writing boilerplate that doesn’t require conscious effort, your brain is synthesizing related information in the background.

If agents remove too much of that loop we ship faster in the short term but understand less of what we are building and why. That gap shows up later as bad architecture decisions, slower debugging, and inability to drive agents effectively due to lack of context. Before using LLMs I never really appreciated this fact. Modeling the problem was something that just happened automatically in my brain. It was a background task always running while writing code. But with coding agents this understanding no longer comes “for free” as a side effect of implementation. It has to be built deliberately and explicitly, and we’ll probably need new tools and workflows for that.

Parallel Agents vs. Human Attention #

I rarely use parallel agents and I think people are somewhat overestimating their usefulness. If my task is well-defined it doesn’t usually take more than 5 minutes for an agent to finish. I don’t see much point in firing off 10 parallel tasks because the bottleneck is not writing code. It is me giving feedback and providing more context.

But more importantly, the context switching that comes from parallel threads creates mental fatigue. If you feel burned out more quickly when using coding agents, this is probably why. While waiting for a task to finish you start doing other things: looking at your browser, social media, firing off or checking in with another agent, checking your phone, etc. This can happen hundreds of times throughout the day and our brains don’t deal well with that. We all know this problem from mobile phones and social media.

Instead of more parallelization I think we may need better ways to stay involved in the work an agent does in real time and get into a flow state. Watching an agent work should be fun and engaging. Part of this is just speed: if models were 10-100x faster we would wait less and there would be no time to do something else. Another approach is new interface types that keep developers engaged by visualizing the agent’s work and using that to help build the understanding and mental models that we’re losing by not writing code ourselves.

There are exceptions where parallel agents can be truly useful though. For example, long “deep research” loops or workflows blocked on slow external APIs, rate limits, slow compilation, and so on.

Prompts versus code #

The ambiguity of natural language is both a blessing and a curse. We don’t want to write 1,000 words of instructions to generate 100 lines of reliable code. But we also don’t want to be forced to iterate on sloppy generated code multiple times because we failed to provide sufficient context upfront. Natural language is inefficient for specifying exact requirements. Something like a formal specification language, or code, would be better. But natural language excels at compressing instructions by assuming that everyone shares a similar mental model of the world/domain/task. But often that shared world model doesn’t exist. You see humans arguing with each other about semantics of words all the time, and for agents we have to provide the context that is not in its weights explicitly.

Code is deterministic and fast, but verbose and low-level. Prompts are usually more concise because they are ambiguous and rely on a shared world model as a dependency, but they are slower and more expensive to execute. That creates an interesting design question: should a feature (or app) be implemented as a prompt, code, or something in between?

I recently built a small personal tool that uses proxy rotation to fetch audio from YouTube videos, transcribes it with a transcription service, then reformats it with a fixed template. At one end of the spectrum, the whole workflow could be a prompt. The model can figure out how to use yt-dlp, route through residential proxies to get around bot detection, and call the transcription APIs. I’m pretty confident it would one-shot the task. For each new video I could submit a new prompt with just a link. But without memory or cached context the model has to rediscover the workflow and edge cases each time.

At the other end of the spectrum I could write a large Python script that executes the workflow step by step, handles all edge cases, and only uses an LLM for transcription and summarization. That version is faster, cheaper to run, and deterministic to debug. But it costs more engineering time up front, is harder to maintain, and assumes I already know most failure modes, which is usually not true.

There are many approaches in between. I could ask the model to analyze prior runs and write lessons into a memory.md file so it avoids repeating mistakes. Or I can push towards code by having it generate Python scripts for specific substeps, plus explicit markdown instructions and best practices, which is basically what an agent skill is. The process can also be iterative: start prompt-first, collect feedback and edge cases from real runs, then move parts of the workflow into deterministic code to reduce costs. That is one pattern I expect to become quite common. The loop can also be run in the other direction. It can be useful to replace a brittle 5,000-line Python script with a simple prompt routed to a cheaper model.

Guardrails beat vibes #

In my experience, productivity with coding agents ranges anywhere from 0.1x (net negative) to 10x depending on the task. The biggest gains usually come in situations where strict correctness does not matter much (quick prototype/experiment/MVP), or correctness is tightly constrained by strong guardrails.

Here are a few examples of tasks where agents truly shined and I saw productivity increases of 10x or more:

Large mechanical refactors in well-tested codebases. Rrename symbols, move files, update imports, tweak interfaces, etc. If I can say “don’t change tests except names/imports, and everything must still pass,” that is often close to a one-shot task. I have seen refactors like this go from 1-2 hours manually to ~5 minutes with an agent.
Scaffolding greenfield projects where most code is standard boilerplate the model has seen many times in its training data, e.g. a FastAPI or react app with a few standard features.
One-off data analysis tasks with common setup like loading/parsing JSON, pandas wrangling, duckdb queries, matplotlib plotting, etc. A lot of data analysis code is throaway code anyway and the only thing that matters is the output artifact, which is a great fit for agents.

Unfortunately these are not typical day-to-day engineering tasks for most people. You are not scaffolding new MVPs or refactoring large existing codebases each day (unless you are an AI influencer). Most daily work is messy brownfield work in more complex codebases with a lot of implicit context. In those settings agents are still extremely useful, but I would estimate the average productivity gain to be closer to 1-2x, not 10x. While it doesn’t happen often, I have had a couple sessions where agents failed spectacularly because the task or context was underspecified and I spent more time cleaning up the output than I would have coding it myself. The challenge is that it’s not obvious upfront what exact context an agent needs and what it figures out by itself.

I do expect that guardrails optimized for agentic coding will become a more important part of writing code. Strongly typed languages and compilers are at an advantage because they give immediate feedback. Dynamic languages still work well with the right tooling, but optional type checkers like mypy are often easier to skip, harder to enforce consistently, and noisier in prompts. Tests, especially higher-level integration tests (or behavior specifications) that can survive larger refactors and act as a type of documentation, have also been quite helpful. Maybe we will see a bit of a comeback of the kind of behavior-driven development that was popular in the 2000s.

Conclusion #

Only LLMs write conclusions these days.