Using LLMs Daily
06 Jul 2025 artificial-intelligenceThe software industry is moving towards a world where every engineer is expected to use LLMs more; the practice of keeping track of LLM usage frequencies and advocating for more usage will probably become the norm in a few years.1 I don’t believe the hype; I think we’re in a fairly early state. If one looks at what is possible today, and the historical practice in this area to over-promise and under-deliver, it seems highly unlikely that significant changes are just around the corner. But every article about AI has this mandatory suffix: “AI can’t do that; at least, not yet.” I am not sure what all of that is based on. As a full-time software engineer, I have been unable to ignore these products completely. People in other professions should definitely consider ignoring it, even though there are vague reports of LLMs having some impact everywhere. This post is a summary of what I have been using LLMs for recently and where I think they are usable.
I did not use LLMs for real until early this year. I would occasionally use some bot as a toy,
mainly for amusement. My favorite was to plug in queries such as why this movie bad
. The response
was an unnecessarily long essay, where the bot pretends to be thoughtful, and goes on to list a
bunch of extremely mundane and generic reasons, which would apply to almost any bad movie. The same
thing happened with most of the other prompts that I was using. Now that I have started using LLMs
on a more serious note and on a semi-regular basis, I am starting to see where these systems are
helpful.
This post will be provider-agnostic. A cottage industry has sprung up around comparing and ranking
models, and how they perform on a variety of “tasks.” Another one has sprung up around comparing the
products that are built on top of these models. Most of these are like performance benchmarks for
traditional software: they have little relevance to using said software in real life. MySQL
benchmarks look great on paper, and I have had a lot of fun using sysbench
to prove that a single
configuration change will make binary log processing 5 times more performant. However, when this
change is applied in production, the results are underwhelming.
I remember reading about models being able to pass standardized tests, and win games like Diplomacy, in articles from a couple years ago; these seem like fuel for the hype cycle to me: Put your model up against 10,000 tests, publish the 10 that it does exceptionally well on. This post attempted to compare a few models on a single task. The main thing I took away from it was the pointless nature of the whole ordeal. Despite testing many setups, the author proved nothing because there was no way to reduce these models to systems that can be compared 1-1 against each other: If one model did better on one metric, another one would have done better on a completely different metric. There is a lot of subjective stuff happening inside the LLM; no one understands how to compare this behavior objectively.
What Works
Some things do work well when conversing with an LLM. Something that I have found useful is
summarizing simple information and converting unstructured information into structured information
(tables, JSON, YAML) For example, if you were comparing phones, and wanted the release dates, screen
sizes, and prices as a table, most LLMs can create this easily. This works irrespective of the input
and output formats; Supplying a CSV file and asking for a new column with some data, or any other
kind of text processing that can be done using awk
and sed
will probably work. I have mostly
used these on YAML files, which I come across periodically while dealing with configurations. These
use-cases are not surprising. Spreadsheets are able to do a lot of great and magical things, and the
conversational interface is probably an improvement for some users. The model’s ability to explain
its methodology is certainly the most useful part.
One surprising use-case for me personally has been for talking about open-ended questions. I don’t expect a real or accurate answer from the LLM at all; I expect responses that point me to places where a piece of configuration or some code might be wrong. I am engaging in a conversation in order to figure out which way to go, what to test, which parts of the code to investigate further. This has been useful when dealing with new technologies, where I am unfamiliar with the standard patterns. By asking questions, looking into the suggested options, poking holes through what the LLM is offering, I was able to learn the underlying technology iteratively, and it helped me get to a good solution. I don’t think this process is faster, but I definitely felt that I was able to learn more than I could have if there had been no LLM at all and I was just trying a bunch of things to get to the same solution. In some senses, it is like talking to a peer about a problem that you’re working on. However, the solution’s quality is limited by two factors that gets lost in the AI-driven hype fog:
- How much do you already know about the basics of the problem space? Even if you are not familiar with the specifics of the system that you are using, are you able to model the solution well enough in your mind that? When an LLM suggests something, are you able to verify that it is going in the right direction?
- How much are you willing to trust the model? If the LLM suggests a solution, are you going to implement it as-is in production? Do you have other environments where you can test the solution? Do you have peers who can validate the solution before you accept it? Do you understand all the side-effects of the suggested solution?
I don’t think I could ever really trust a solution that comes from the model directly. I would like to run a sandbox environment and test what I am planning to do. Depending on the task’s complexity and the potential for making subtle mistakes, there might be places where the model output may be usable as-is (say: A script which only reads from an API). I believe LLMs works better as conversation starters than as solution finders.
Another use-case is to answer questions that have an answer which can be found in a deterministic manner but the software to do so does not exist yet: Why do some people prefer to use non-ISO timestamps in non-UTC time zones? Why is it hard to convert those times to something that is easier to understand? Not having to visit worldtimebuddy.com in order to find out what “CST” means would be great. Again, the questionable trustworthiness of LLMs rears its ugly head. A deterministic program would be easier to trust and use on a daily basis.
A variation of this use-case is one where I am willing to roll the dice: For instance, simple syntax questions where the cost of being wrong is almost non-existent: “How do I specify defaults in Helm templates?”; “How do you use getopts in Bash again?”; “Is it case-when or switch-whatever in this language?” … Mistakes are caught right away, if you have the decency to use a compiled language.2
What Doesn’t Work
There are many things that do not work well though. These are, in my mind, segmented into two
experiences: as the generator of content and as the reader of AI generated text that is
unnecessarily verbose garbage.
As a generator, for me, LLMs have not worked with anything to do with general Linux configuration. The mind boggling multitude of valid configurations, the various versions of software involved, an inability to really parse structured documentation (such as manual pages), and a general lack of “convention over configuration”, makes them a very bad hammer. There are as many ways to use Helm as the number of engineers using it. Understanding where a variable is being specified from inside an Ansible task is quite difficult. There are a variety of ways in which SystemD configurations are written, stored, and referenced. How Gnome does something is explained in documentation but this might change slightly from one version to the next, and these changes might not be rigorously documented anywhere.
So, I was never able to figure out some simple things, such as how Gnome’s session start-up really
works, and whether I can control what services are started up along with the gnome-shell
session. (I wanted the GSD Smartcard
service to stop starting up automatically whenever I had a
YubiKey plugged in; the LLM sent me on a merry-go-around for a while, asking me to edit various
files and check the result by rebooting. Finally, I gave up and looked for StackOverflow-style
answers from other people and found one that explained how to do it, and why the solution works.)
There is also the constant risk of LLMs making up non-existent syntax, or using non-existent packages and asking the user to install those packages. I don’t think there will be any huge “jump” here. It is simply harder to deal with a significant Ansible repository (with hundreds of roles, playbooks, group variables) than it is to build something new and self-contained.
My perspective on the value of being able to build something new is that it is a useful skill, but it does not happen frequently enough, so automating it is not prudent. In my limited experience, engineers rarely build completely new things. Most of the time, software engineers are iteratively improving existing software. When it does come time to build something completely new, it would be much better to start from a blank canvas rather than by making small changes to AI-generated base applications.
(Wasn’t the whole genesis of the failed experiment known as Web 3 that software engineers were bored with building small improvements to existing stuff and wanted to … start from an empty canvas and revolutionize the world (of finance)?)
Generating boilerplate (or unit tests) is cited as another common use-case. Even writing boilerplate code can be useful (if not fun). If the process of writing it once helps you improve your personal tooling, enabling you to do the same thing much faster the next time you need to do it, then writing boilerplate is a required step in the process of sharpening your tools. Tooling improvements compound: a change to your editor configuration that makes boilerplate easier to write will make many other things easier. It was by writing Golang’s table-driven tests manually for a month that I started thinking about snippets and eventually found packages that made storing and inserting snippets into files easier. Once I knew about them, I used snippets for many other use-cases. Another one I remember: I learned about visual block editing in Vim when editing some constant definitions in PHP; but this allowed me to deal with TSV files much faster than before.
Learning how to do it by customizing your tool is faster than what might be the knee jerk reaction if an LLM were close at hand: Open an LLM in your web browser, ask it to do some thing, and wrestle with it until it produces the right output. This is the classic short- vs. long-term time-knowledge trade-off (spending a bit more time today to gain some knowledge which will make doing everything faster months or years from now)
Being a reader of AI-generated text has been absolutely frustrating. Whenever you see a document with too many headings, with each heading having a short sentence that regurgitates the title, and a numbered list below that sentence, you know what follows is AI generated. These are often incredibly hard to read, because they drone on-and-on about obvious things. While most of what is generated is not wrong, it focuses on the wrong things; the generic things: the most important thing about text that is used to communicate an idea to another human being is understanding the constraints that one would have to operate under. Is there a reason that the latest re-org makes this idea the most important thing now? Is there a business reason that we can not use the right technology to solve this problem, requiring some hideous workaround? Is there a general feeling that fully-managed services are better for a company because we are not sure what the headcount of our team will be in 4 months? These are the hard questions people want to see answers to when they are evaluating your idea.
Reading bad AI-generated descriptions makes some of the descriptions I have seen before much better (but still not justifiable as “effective communication”):
TTSIA
(or “the title says it all”)- Link to another page; without any explanation
- Bonus: that page does not have any explanation either
- Link to a Slack thread; again, without any explanation
I used to believe that those were the absolute worst; but AI-generated descriptions are worse. Descriptions are for communication between people.
Another pattern that is catching on is putting task descriptions into an LLM, and submitting the code that is generated as-is for review. Most of the companies which build “coding assistants” are actively boosting this use-case, because they believe that it will really move the needle (for them). The products that I have used are nowhere near this level; they struggle to understand the file that I have open in the IDE in front of me right now; if they can understand a single file, they are not able to look for other files and understand the structure of the code; most are unable to generate a valid Git patch that will apply on a file that I submitted as input.
Red Lines
There are some red lines for me: areas in my life where I don’t see myself ever using LLMs.
- I don’t want to use LLMs for my personal projects, or for writing notes or posts on this blog. I do these for fun and learning; the output is secondary. Debugging an existing Ansible role written by an LLM is substantially less fun than writing something from scratch, and iteratively building it into something useful. It is the latter that aids fun and learning
- I would not use LLMs in any setting where the accuracy of the output can not be verified easily and immediately; the temptation to just believe the model and move forward is strong because the natural language output resembles what a very confident engineer would write. However, making decisions based on the summary of a report is worse than not using the report at all, and relying exclusively on your intuition
- I wouldn’t use LLMs to plan trips or prepare itineraries. These are again things that I like to do, and even though there are a million small decisions to make during the planning phase, it is usually quite pleasant. And even if the places I end up going to are not the greatest, I don’t mind that either: Once again, the output is secondary to the process.
I have been meaning to write about this for a few months now, but I have put it off because I have not been able to come up with a general framework for the places where AI usage is acceptable to me. I don’t have such a framework, and I believe that it will be difficult to come up with one because the tool itself is not static, making it hard to pin down.
One of the existing problems with the Internet is that primary sources are behind paywalls, and there are way too many secondary and tertiary sources (news sites, commentary, commentary about the commentary, …) LLMs exacerbate this situation by reducing the cost of running a tertiary site to zero: Publishing the summary of an article which summarizes a paper that appeared in a scientific journal requires no effort anymore. There are sites that mirror StackOverflow’s content and use SEO to get their websites to the top of search results.3 LLMs will help these owners summarize questions and put a short blurb at the top, further helping them game the search engine into ranking them above the primary source.
LLMs will create a world where text abounds and no one reads that text anymore. There is no surprise then that everyone is migrating towards short videos; forms of media where a person is portrayed.
I am under no illusion that this boulder has gained enough velocity that standing in its way would be foolhardy. In my opinion, in software, it has become a requirement to use AI/LLMs at work. Not using an LLM leads to the question, “What is wrong with it? Which product should we buy a license for instead?”, leaving little room for opinion or choice. I believe that defining the situations where you don’t want to use an LLM, no matter how good it gets, is the starting point in getting used to this new tool. In the face of a developing new tool, as products and companies navigate towards its eventual steady state usage patterns, it is important that everyone thinks about their red lines before using the tool becomes the norm.
-
I am talking about companies which have not realized this yet. This is already a practice in many places, from what I have heard anecdotally. ↩
-
If you are using an interpreted language, you have much bigger problems – I have forgotten the source of this quote. ↩
-
Why do such sites exist? What is their economics? ↩