I tried using Claude Code while writing Caledonia, and these are the notes I took on the experience. It’s possible some of the deficiencies are due to the model’s smaller training set of OCaml code compared to more popular languages, but there’s work being done to improve this situation.
It needs a lot of hand-holding, often finding it very difficult to get out of simple mistakes. For example, it frequently forgot to bracket nested match statements,
match expr1 with
| Pattern1 ->match expr2 with
| Pattern2a -> result2a
| Pattern2b -> result2b | Pattern2 -> result2
and it found it difficult to fix this as the
compiler error message only showed the line with Pattern2
. An interesting note here is that tools
that are easy for humans to use, e.g. with great error messages, are
also easy for the LLM to use. But, unlike (I hope) a human, even after
adding a rule to avoid this in CLAUDE.md
it frequently ignored it.
It often makes code very verbose or inelegant, especially after repeated rounds of back-and-forth with the compiler. It rarely shortens code, whereas some of the best changes I make to codebases have a negative impact on the lines of code (LoC) count. I think this is how you end up with 35k LoC recipe apps, and I wonder how maintainable these codes bases will be.
If you give it a high level task, even after
creating an architecture plan, it often makes poor design decisions that
don’t consider future scenarios. For example, it combined all the .ics
files into a single calendar which when it
comes to modifying events them will make it impossible to write edits
back. Another example of where it unnecessarily constrained interfaces
was by making query and sorting parameters variants, whereas porting
to a lambda and comparator allowed for more expressivity with the same
brevity.
But while programming I often find myself doing a lot of ‘plumbing’ things through, and it excels at these more mundane tasks. It’s also able to do more intermediate tasks, with some back and forth about design decision. For example, once I got the list command working it was able to get the query command working without me writing any code – just prompting with design suggestions like pulling common parameters into a separate module (see the verbosity point again). Another example of a task where it excels is writing command line argument parsing logic, with more documentation than I would have the will to write myself.
It’s also awesome to get it to write tests where I would never otherwise for a personal project, even with the above caveats applying to them. It also gives the model something to check against when making changes, though when encountering errors with tests it tends to change the test to be incorrect to pass the compiler, rather than fixing the underlying problem.
It’s somewhat concerning that this agent is running
without any sandboxing. There is some degree of control over what
directories it can access, and what tools it can invoke, but I’m sure a
sufficiently motivated adversary could trivially get around all of them.
While deploying Enki on hippo
I tested out using it to change the NixOS config, and after making the
change it successfully invoked sudo
to do
a nixos-rebuild switch
as I had just used
sudo myself in the same shell session. Patrick’s work on shelter could
prove invaluable for this, while also giving the agent ‘rollback’
capabilities!
Something I’m wondering about while using these agents is whether they’ll just be another tool to augment the capabilities of software engineers; or if they’ll increasingly replace the need for software engineers entirely.
I tend towards the former, but only time will tell.
If you have any questions or comments on this feel free to get in touch.