Tales from the Developer: Troubleshooting and fixing issues with Varnish
By: Karn Kallio | September 8, 2022 | development and Tales from a Developer
When things go seriously wrong in a well-built but complicated system, the cause is often a cascade of small failures that pile up. Not that we’re building rockets over here, but an excellent example is the ill-fated Ariane flight V88. Solving the issues behind such failures can be problematic. Many things contribute to the difficulty, from the extensive use of caching to the need to convincingly reconstruct the failure in retrospect. Debugging this kind of failure, especially under pressure, is hard. We consider the ability to do this to be the hallmark of a senior developer. Even so, it is essential to have a team of experts working together to troubleshoot issues and find solutions promptly to help keep your site running.
The first signs of trouble
Recently, we had a perfect storm of cascading failures on some of our client’s sites. The issue at heart was due to a security update and a poorly defined practice that returned to haunt us. As with many catastrophes, the problem arose at the most inopportune time. At 3:20 am Pacific, we started to get a flood of notifications that one of our client websites was down. While this was much earlier than anyone could be expected on the clock for the day, we got on the issue immediately. Mugo has a particular strength in the diversity of our developers. We have team members from many different backgrounds, with many different skill sets, all across the globe (and across many time zones). From Australia to Germany, Brazil, the United States, and Canada, there’s always someone ready to jump on a problem if it arises. Team members started working on the issue immediately and were able to bring other team members in to help with the process.
Identifying the problem
The first things you think to check in most troubleshooting situations (not necessarily in this order) are:
- What changed
- Cache
- Logs
We had not done a release the previous day, and at that time of day, no edits in the CMS had been made, let alone edits that could take down the entire site.
At that time of day, there wasn’t much harm in clearing the principal cache as a basic reset. However, the cache clear command yielded an error about not being able to connect to Varnish. (Varnish is an import cache layer that insulates the backend content servers.)
The next step was to check the logs. There are many logs available for a website, including but not limited to:
- CMS logs
- Application logs (Apache, Varnish)
- Infrastructure logs (AWS, including the load balancer)
- Network Logs
- Operating system logs
It was in the Ubuntu operating system logs at /var/log/syslog where we got the best clue:
Jun 9 03:13:37 prod-server-acme systemd[1]: Stopping Varnish HTTP accelerator...
Jun 9 03:13:37 prod-server-acme varnishd[1887511]: Child (1887524) died signal=15
Jun 9 03:13:37 prod-server-acme varnishd[1887511]: Child cleanup complete
Jun 9 03:13:37 prod-server-acme varnishd[1887511]: manager stopping child
Jun 9 03:13:37 prod-server-acme varnishd[1887511]: manager dies
Jun 9 03:13:37 prod-server-acme systemd[1]: varnish.service: Main process exited, code=exited, status=64/USAGE
Jun 9 03:13:37 prod-server-acme systemd[1]: varnish.service: Failed with result 'exit-code'.
Jun 9 03:13:37 prod-server-acme systemd[1]: Stopped Varnish HTTP accelerator.
Jun 9 03:13:37 prod-server-acme systemd[1]: Started Varnish HTTP accelerator.
Jun 9 03:13:37 prod-server-acme systemd[1]: Starting Varnish HTTP accelerator log daemon...
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: Error:
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: Message from VCC-compiler:
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: Could not load VMOD xkey
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: #011File name: libvmod_xkey.so
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: #011dlerror: (null)
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: ('/etc/varnish/default.vcl' Line 5 Pos 8)
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: import xkey;
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: -------####-
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: Error:
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: Message from VCC-compiler:
Jun 9 03:13:37 prod-server-acme varnishd[1819260]: Could not load VMOD xkey
This told us that Varnish had been updated and restarted by an automatic update script. This automatic update is generally a good process because it installs only bug fix patches to any software, which should not have any backward compatibility issues. In this case, however, Varnish couldn’t restart itself.
So this gave us clear direction to review the Varnish configuration. There are a few ways to check on the status of Varnish, including “sudo systemctl status varnish”. This leads you to check the Varnish service file, and it pointed to /lib/systemd/system/varnish.service. That didn’t seem right for a couple of reasons:
- We typically didn’t edit the service file there
- The contents looked rather default, including various settings such as the Varnish port
When comparing this against the service file in our repo, we also discovered that it pointed to a different Varnish exec path. That’s when the cascade of issues revealed itself and gave us a narrative for “what changed”:
- We had been using the Varnish “kernel” service file for a while instead of a proper override service file.
- One of our edits was pointing to a manually installed version of Varnish 6.0. The use of Varnish 6.0 was a purposeful decision, as Varnish 6.0 was the Long-Term Support (LTS) version.
- We also had Varnish 6.2 installed on the server.
- Varnish 6.2 had auto-updated due to a security patch, and that auto-update included its “kernel” service file. This service file pointed to our Varnish VCL configuration file that was tailored to Varnish 6.0. Therefore, trying to run Varnish 6.2 against a Varnish 6.0 configuration file failed.
The most immediate fix was to restore our service file override – properly this time – by adding /etc/systemd/system/varnish.service or running “sudo systemctl edit --full varnish.service”. Then we could start Varnish 6.0 again to get the website back, solving the immediate issue. From there, we could work on a full retrospective and fix.
Reevaluating & revising best practices
After a quick triage identified other sites where this would also be a potential issue, we applied the same solution to prevent further outages. We also codified our policy regarding where we source our 3rd party software. The first choice (and best use) is from the Ubuntu Main repository. The Main repository has free software canonically supported (LTS versions). If the necessary software isn’t available, the next step is to get it upstream from the original developers themselves. Since the updates are not automatic in this instance, a staff member must take ownership of the package, regularly check for updates, and deploy them. As a last resort, we will source software from Ubuntu’s Universe (or community) repository. While this can be a valuable resource, the updates and software versions are not necessarily canonical to the developer’s versions. They are also not guaranteed by the original developer. Security patches for software available in the Universe repository only happen if someone in the community decides to do it. As was the case in our instance, this might cause conflicts with the original packages.
Troubleshooting as a team saves trouble in the long term
Troubleshooting can be frustrating, but part of the reason we could find the issue in this instance so quickly and address it promptly was due to our team’s strength and collaborative efforts.
Troubleshooting as a team is a much more efficient process than going it alone for several reasons. You do not get stuck during the investigation process because another team member always has a different idea to try. Screen sharing and having other eyes on your project means team members can point out errors you miss. You can bounce ideas between each other and find the best path forward for solutions. When the solution proves to be time-consuming, you can split tasks and share the burden. One of you might have more experience and expertise in certain areas, and you can play those strengths off each other to compliment the work. And sometimes, someone might remember a resource or a similar problem in the past. Relying on a more experienced coworker with greater institutional memory than you can be a valuable tactic.
Learning & growing from mistakes
While this was ultimately an issue on our end, it was solved quickly because we have a strong team ready to rely on each other’s strengths and find things that might otherwise have been overlooked. Having multiple perspectives meant that the problem was not prolonged. Having best practices in place, understood, and used by the whole team means there is less chance of the same mistake happening again.
When it comes to websites, things happen. What matters most is that when they do, developers are ready to jump in to fix a problem, learn, and grow together to become a stronger team. This is also why having a development partner for your business is essential. When something happens to your site, you don’t want to face the problem alone. You want a team of experts with experience in similar issues at your back, ready to get your site back up and running as soon as possible.