What do some production incidents and airplane disasters have in common ? Let me tell you a very short, but terrible story.
On the 4th of October 1992 at 6:35 PM, El Al flight 1862 en route to Tel Aviv crashed into two high-rise apartment complexes, exploding in a fireball. This caused the building to partially collapse inward, destroying dozens of apartments. The 3 pilots and 1 passenger on board were killed instantly, as well as 39 on the ground.
The root cause of such a terrible event ? Insidious metal fatigue: for years, unnoticed by anyone, cracks formed in the jet engine supports, leading to one of them eventually snapping in flight.
Now, I’m pretty sure anyone who took part in a fair number of Post Incident Reviews can see where I’m going with this. For the others, my point is that a lot of production incidents are caused by unnoticed problems. More than once in my career have I had the extreme displeasure to find out that system services or applications had been dutifully logging failures and errors sometimes for days all for nothing, because nobody was watching.
How could this happen ? Well, pick one or more: real world, understaffed teams, budget cuts, busy firefighting recurring issues, badly tuned alerting leading to an excessive number of false positives in return leading to complacency… Nothing out of the ordinary in the corporate world. These are decades-old challenges that have been plaguing the software industry since its infancy.
There is an easy fix though, or at least a way to bust unnoticed problems out of the logs: The Almighty United Fails Query. The name might sound silly, but the idea is good. In this article, I will show you how to shine a light in the dusty dark corners of your logs and how to wipe them clean of any nasties. I will use Sumologic in my examples, but you can absolutely do the same with Splunk or any other half-decent log analytics tool.
An interesting experiment
Have you heard of dumpster diving ? If not, here is a link for you.
If yes, sorry I compared your logs repository/storage/server/concentrator/blackhole to a dumpster, but you have to admit it’s…pretty accurate. You can’t deny it, because like everyone else, you frowned during the trash compactor scene in Star Wars episode 4.
So let’s merrily go fishing for unwanted treasures in the depth of your logs.
Open that door Gandalf !
The Sumologic query
(« error » OR « fail » OR « failed » OR « failure » OR « time out » OR « timeout »
OR « unable ») AND !(« _timeout » OR « with timeout » OR « using timeout » OR
« -timeout » OR « and timeout ») AND !(« error. » OR « error events ») AND
!(« may fail »)
| parse field=_sourceHost « *-i- » as stack
| count by stack
| sort by _count
You might have to customize it a bit, but it’s pretty functional as it is. Pay close attention to the _sourceHost field as your naming convention might be different from what is shown here.
That query parses all the logs looking for keywords like:
- time out
It also contains a few NOT clauses added there to keep false positives from happening; tailor that part to your own needs.
What you will get on your screen after starting the search process depends on several parameters:
- the size of your infrastructure
- the number of applications deployed
- the way dev teams are organized
- the general karma of the company
Over the years, I identified 3 scenarios:
- dry logs: the application is almost mute, logging very few information, sometimes even just a pretty useless message like « Starting main loop » or « Alpha mesh counter (double-float): 10 »,
- noisy logs: the opposite of the case described above, the application is way too chatty, filling megabytes of disk space every second with what looks like debug information or worse, stack traces (these must go in a separate file),
- cryptic logs: the application is logging a lot of uninformative information that perplexes anyone looking at it. The developers meant well, but they missed the point: once an application is in prod, logs are not there to help developers do their job anymore, but to assist the DevOps Engineers during incidents.
Often, you will find a mix of all of these: some apps being very chatty, some others not saying much or detailing absolutely everything that’s going on in the pipes. What to do ? Make a list with 4 columns: Dry, Noisy, Cryptic, Owner
Supportability is a feature
There are 3 stages in the life of an application:
- development: developers are racing towards the deadline, doing their best to get the initial release ready in time. At that point, logs are there mostly to check the execution flow and get some key values printed out. Log-wise, the focus is put on the developers needs,
- production: the umbilical has been cut, the baby is now roaming the wilderness on its own. Logs are now used to prevent abuse by malevolent people and to assist during incidents. Log-wise, the focus is put on the Ops team needs,
- legacy: the application is old and is in the process of being replaced. Logs verbosity level is sometimes reduced depending on what the app remains in use for.
Developers and Ops have to collaborate closely right from the conception phase in order to make sure the code being produced and run on laptops will be as easy to maintain as possible once in the Cloud. Often, logs remain stuck in the development phase: they give good information about the execution flow, but are not that useful to DevOps Engineers trying to solve a production incident.
What Ops need is not messages about the workflow, but rather messages about the problems preventing a smooth run:
- Unresponsive 3rd party API,
- Database connection failure,
- Running out of heap space,
- Shared resource unavailable
Supportability is an indicator of how convenient it is to look after an application once it’s in production/legacy. An app that makes it easy for Ops to keep running is way less likely to experience extended downtime. It also helps greatly in keeping a smile on everyone’s face: developers feel like they are doing a great job, ops feel confident in their ability to keep end-users satisfied and business people appreciate the 99.99% average uptime they see along the year.
Who is noisy ?
Noisy logs are not good for supportability, as they tend to increase the number of filters required in a search before getting to useful information. As explained above, production logs are not development logs, they need to be fairly straightforward about the issues encountered by the app.
Who hardly makes sense ?
Cryptic logs are also not good for supportability, as they don’t give Ops, the people in charge of keeping the app alive in production, relevant to-the-point information about the problem at hand. Reading between the lines is a tricky time-consuming business and should not be necessary.
Who is awkwardly silent ?
Having to fire gdb and tcpdump whenever a production incident happens is the best way to create useless tension between dev and ops. Do your best to ensure that every application put into production provides enough information to BAU & Security people.
Now that you have a list of dry, noisy and cryptic applications, it’s your responsibility to engage their current owners: there is no such thing as a finished application.
All along its life an app is likely to require adjustments like extra features or security fixes. A dead code base is a liability: an application that was considered secure in 2010 is now likely to be vulnerable to the most common attacks.