I did something stupid again. I noticed it, when I received notifications from my Prometheus monitoring. The probe that sends HTTP requests to my Nextcloud server was failing. Strangely, no other probes were failing. No high load, no memory exhaustion, no filesystem running full.
Manual testing confirmed that all my other web-apps worked as expected, only Nextcloud was responding with 502 errors. It looked like my nginx reverse-proxy could not reach the Nextcloud backend.
I ssh connected into the underlying host, and re-checked the vitals. They looked fine. Next I checked all the Docker containers that run my web-apps. All of them looked fine, except for my Nextcloud container, which was caught in a restart loop. And in the container’s logs, dozens of messages like this one had piled up:
Can't start Nextcloud because the version of the data (31.0.9.1) is higher than the docker image version (31.0.8.1) and downgrading is not supported. Are you sure you have pulled the newest image version?
Kudos to the maintainers of the Nextcloud Docker image! I wish all software had such precise and helpful log messages. At least in my case, this told me exactly what had gone wrong and how to fix it.
Indeed, I had accidentally downgraded to an older image version. I quickly upgraded to the latest one, which fixed the whole problem. Case closed.
Background – doing Docker the weird way
But how had I managed to downgrade my Nextcloud Docker image in the first place? Before answering this, let me share some background…
I’m using Ansible to manage most of my personal IT infrastructure. It’s a rather hybrid setup. It manages everything that runs here on meque.de, but also my laptop computer, some desktop VMs, and a couple of Rapberry Pis. In particular, I’m using Ansible’s community.docker.docker_container module to manage my Docker containers.
When configuring my Docker containers, I’m referencing Docker images by digest rather than by tag. For example, instead of this Docker image reference:
nextcloud:apache
I would use this one:
nextcloud@sha256:11f158050216614d585886600445e0a1b75ee224d6a6dddc5eba996cc9499fa6
This approach gives me more control about what code is actually running on my infra at any given time. I’ve worked with organizations that had policies that mandated this approach. And I was one of the guys who was supposed to encourage the implementation of these policies. Since I like to live what I preach, I started to adopt this policy for my own infra. Also, it kinda aligns with my control-freak tendencies.
There are some benefits to referencing software artifacts by immutable cryptographic identifiers (like Docker image digests) rather than by ephemeral aliases (like Docker tags). If something goes horribly wrong, it can make forensic analysis much easier. E.g. after a vulnerability becomes known for a specific image version – be it caused by neglect or by a deliberate supply-chain attack.
However, I believe that it is far more important to update to new software versions as early as possible. In general, this far more beneficial for information security than compliance to software traceability policies.
Here is the low-tech solution that I use to address trade-off:
- In my Ansible IaC I’m maintaining a Docker image catalog (in an Ansible variables file in YAML format). Each image catalogue entry contains:
- an image repository name,
- a specific image digest,
- an image tag that I want to follow.
- When configuring a Docker container, my Ansible code references an image catalog entry to specify the image. It only uses repo name and digest (i.e. a and b), but not the tag (c).
- I’m also using Ansible to setup Prometheus monitoring for all images in the catalog. Whenever the upstream Docker registry (i.e. Docker Hub) moves a Docker image tag to a new Docker image digest, this will show in my Prometheus alerts dashboard.
(I’m using Skopeo and Prometheus Script Exporter to implement this.) - Whenever I see alerts about outdated Docker image digests, I run a small update script on my Ansible code base. This just walks through my image catalog and updates all image digests (b) so that they follow the desired image tag (c) again.
(This script also leverages Skopeo.) - Then I run my main Ansible playbook, which recreates all Docker containers, based on the new image digest from the catalog. It also adjusts the Prometheus config to match the updated Docker image catalog.
- Finally, I check for new Prometheus alerts and do some manual smoke-testing. If I’m happy with the result, I commit the updated Docker image catalog to the git repo that holds my Ansible code.
(If I’m not happy with the result, I can roll back by adjusting the image catalog manually.)
This approach balances my personal needs for control vs timely updates quite nicely. Most of it is automated, but it still requires human interaction.
Root cause & fix
Generally, my Docker image update approach works well. In itself, it was not the root cause of my Nextcloud outage. Instead, I had to make some mistakes on top of it…
In my Ansible git repo, I’m currently working with multiple branches. Recently, I was mostly working on a feature branch to implement new functionality that is not related to Nextcloud at all. However, I did update the image catalog in this branch and I did run it against my production infrastructure. This ensured that all my Docker containers were running up-to-date image versions. So far so good.
Yesterday, I switched back to my default branch, because I wanted to roll out some small improvement that were not related to my new feature. I was planning to merge these small changes into my feature branch later. But first, I ran my main Ansible playbook on the default branch. I had totally forgotten that my default branch contained an outdated version of my Docker image catalog. Once Ansible was done, my Nextcloud container was suddenly running an outdated Docker image version. But as the log messages later showed, Nexcloud does not like to do this (and I don’t blame it). So, we’re finally at the root cause of my Nextcloud outage.
To fix the problem, I simply ran my update script again on the default branch. Since the script is idempotent, it updated my Docker image catalog to the same image digests that I already had on my feature branch. After re-running the Ansible playbook, Nextcloud came up again with no complaints.
Lessons learned
According to my Prometheus metrics, the Nextcloud outage lasted about half an hour. This is not great, and I should be more careful in the future. And I should respond to Prometheus alerts more quickly. But overall, it wasn’t a big deal for a Nextcloud instance that’s just for my personal use.
Maybe I’ll need more automation, testing, rollbacks, etc. in the future. And a dedicated staging environment for all my infra. But for the time being, I’ll just be more careful when switching between branches. Always update the image catalog after switching to a new branch! Or, always update it on the default branch and merge it into all active feature branches afterwards.
I’ve had Nextcloud up and running for almost 2 years now. Afaict, this was the worst outage so far. This is fine.