A Week With Claude Code

Published 21 Apr. 2025.

I tried using Claude Code while writing Caledonia, and these are the notes I took on the experience. It’s possible some of the deficiencies are due to the model’s smaller training set of OCaml code compared to more popular languages, but there’s work being done to improve this situation.

It needs a lot of hand-holding, often finding it very difficult to get out of simple mistakes. For example, it frequently forgot to bracket nested match statements,

match expr1 with
| Pattern1 ->
    match expr2 with
    | Pattern2a -> result2a
    | Pattern2b -> result2b
| Pattern2 -> result2

and it found it difficult to fix this as the compiler error message only showed the line with Pattern2. An interesting note here is that tools that are easy for humans to use, e.g. with great error messages, are also easy for the LLM to use. But, unlike (I hope) a human, even after adding a rule to avoid this in CLAUDE.md it frequently ignored it.

It often makes code very verbose or inelegant, especially after repeated rounds of back-and-forth with the compiler. It rarely shortens code, whereas some of the best changes I make to codebases have a negative impact on the lines of code (LoC) count. I think this is how you end up with 35k LoC recipe apps, and I wonder how maintainable these codes bases will be.

If you give it a high level task, even after creating an architecture plan, it often makes poor design decisions that don’t consider future scenarios. For example, it combined all the .ics files into a single calendar which when it comes to modifying events them will make it impossible to write edits back. Another example of where it unnecessarily constrained interfaces was by making query and sorting parameters variants, whereas porting to a lambda and comparator allowed for more expressivity with the same brevity.

But while programming I often find myself doing a lot of ‘plumbing’ things through, and it excels at these more mundane tasks. It’s also able to do more intermediate tasks, with some back and forth about design decision. For example, once I got the list command working it was able to get the query command working without me writing any code – just prompting with design suggestions like pulling common parameters into a separate module (see the verbosity point again). Another example of a task where it excels is writing command line argument parsing logic, with more documentation than I would have the will to write myself.

It’s also awesome to get it to write tests where I would never otherwise for a personal project, even with the above caveats applying to them. It also gives the model something to check against when making changes, though when encountering errors with tests it tends to change the test to be incorrect to pass the compiler, rather than fixing the underlying problem.

It’s somewhat concerning that this agent is running without any sandboxing. There is some degree of control over what directories it can access, and what tools it can invoke, but I’m sure a sufficiently motivated adversary could trivially get around all of them. While deploying Enki on hippo I tested out using it to change the NixOS config, and after making the change it successfully invoked sudo to do a nixos-rebuild switch as I had just used sudo myself in the same shell session. Patrick’s work on shelter could prove invaluable for this, while also giving the agent ‘rollback’ capabilities!

Something I’m wondering about while using these agents is whether they’ll just be another tool to augment the capabilities of software engineers; or if they’ll increasingly replace the need for software engineers entirely.

I tend towards the former, but only time will tell.

If you have any questions or comments on this feel free to get in touch.