Better Ansible Playbooks and Roles

02 Feb 2025 ansible · automation

Ansible is one of those technologies that most SREs have come across at some point or the other. Despite the mindnumbing increase in abstraction layers in the past few years, trusty old Ansible does not go anywhere.¹ Ansible is based on a very simple idea: declarative management of the state of a computer. The implementation of this idea, and its final use-cases are fairly complex. Ansible is used for everything from setting up computers and servers, to deploying continuously to VMs. I have been using Ansible for the servers that I manage: DNS, RSS reader. I have also used Ansible for setting up my personal Linux computer. With both these approaches, I have found some limited success, but I am left with a list of pain points almost always. The pain points are indicative of the increase in complexity when it comes to setting up computers, installing software and keeping it up-to-date.

The point at which I started considering writing a post on this topic was when I realized that I had over 50 roles, spread across three directories. I know that a bunch of the logic in some of these roles is simply duplicated across various machines that I have wanted to set up at various points in time. Before diving in and attempting to cull the duplicates and bring everything back under control, I wanted to take a step back and write down what I want from the eventual set up that I am thinking about now.

What Do You Need?

Scope creep is rather common in engineering projects. What starts out as a minimalist RSS reader might end up containing a full-fledged XML parser. My approach to Ansible has been admittedly haphazard. At first, after seeing a playbooks-global monorepo at work, I attempted to keep roles in their own repositories.

This quickly gave rise to the problem which the monoreop is supposed to solve: Duplication and sharing. A lot of the work that Ansible is used to do at the beginning of system set up is routine and the natural defaults which all servers should have anyway. These included things like:

Don’t use the default user that is provided by the cloud provider
Disable SSH using password
Disable root SSH
Set the shell for all users except the one that you are going to use to /sbin/nologin
Enable a firewall
Stop systemd services that you don’t want: Why would a server want CUPS? What is avahi?

Instead of solving this problem properly, by looking through what was required and what was not, I took a shortcut and this was a mistake. I started using the ansible.cfg variable roles_path everywhere. Now, all my Ansible directories import roles from every other directory. Resolving this is easy, simply time-consuming. It is a good lesson to not take shortcuts; it is better to postpone the implementation of a known solution, rather than to workaround the problem using a suboptimal one.²

Naming

Ansible is very readable and imposes no structure on the name of roles. So, you might end up with a bunch of directories, each of which have tasks/main.yml files and names which are not exactly descriptive. If you develop things gradually, then there might even be roles which are either unfinished (half-baked ideas) or roles which are completely developed, but were never tested. Returning to a role’s files a few months later and trying to remember where the role is in its journey is hard and unnecessary. Here is a naming pattern that I intend to use:

under-construction.[role-name]: during development and testing
No prefix, [role-name]: one the role is stable and well-tested
deprecated.[role-name]: once the role is no longer used anywhere else

The role name itself should have its own prefix to clarify where the role is / can be used:

Roles with the under-construction prefix may be included in the appropriate playbook, but MUST NOT be visible on the main branch.
shared.[role]: can be used in multiple playbooks
[service].[role]: can only be used in the playbooks that match the file name glob [service]-*.yml

The [service]. prefix ensures that the monorepo can still work, while the roles remain scoped. I will also attempt to give each role a clear purpose and a clear set of variables that they need to have to run well. This idea seems simple and good enough for the elementary use-cases that I have in my mind.

Testing

Maybe Docker?

I saw this post about using Docker to test Ansible recently. It is a brilliant idea and I want to try it out. A single monthly run which attempts to set up a “container” and tests the output will work for stateless services such as a DNS server or an RSS reader. These are self-contained services whose setup can be tested fairly easily.

I still don’t know how the more complex services would work. If one were to use Ansible to set up a VPN server for instance, one would have to set up two containers, and then test that the two containers are on the same network which is not the shared network in Docker. This setup is too complicated and I don’t really want to think about it at the moment.

Another way to simply skip this stage completely is to not have any automated testing at all. Instead, to attempt to keep the setup up-to-date manually. This approach is not good at all and I don’t have much belief that it will work. (This is the approach that I was using until now: I would use Ansible to set up a host, then come back to the Ansible only when the host broke. The roles themselves would be broken by then, rendering it useless in attempting to fix the host.)

Internet Access

The ultimate variable might be access to the Internet. I have decided to give up on this one though: Even VMs in the cloud can have problems which are temporary and undiagnosable: Recently, I found that apt update did not work on a GCE VM in the asia-northeast1 region. I opened a Support ticket. When a person from Support attempted the same thing two hours later, the dpkg mirrors which were referenced by the GCE VM by default were already fixed. So, was what I faced a true problem? Yes. Will I ever be able to find the root cause behind it? No. So, it is a good idea to just assume that your hosts will have access to download files from any website that is up during setup time.

Now, if it just so happens, that the website were you are downloading files from was taken down or the URL of the file that you need to download changed, then, that is just bad luck. There is nothing one can do about such things except to rely on their intuition about what a role is supposed to do and use that along with the written down purpose of a role.

Conclusion

Everything I have mentioned in this post are just ideas. I will implement some of them over the next few days, and post an update here, as to how it goes.

Not to mention Shell scripts. ↩
sometimes. ↩

Siddharth Kannan's Blog

Better Ansible Playbooks and Roles

What Do You Need?

Naming

Testing

Internet Access

Conclusion

Related Posts

Post 7 - Using MacOS 07 Aug 2025

Post 6 - Short Term vs. Long Term 05 Aug 2025

Post 5 - Emacs and Elisp 02 Aug 2025