It’s a common story: a team builds software, their user base expands, and a few years pass with developers adding new features and fixing bugs. Slowly but surely, more alarms are fired and errors are thrown. It can be tricky to figure out the source of the alarm or exception if the code was written years ago. Logging and monitoring starts feeling like a game of Whack-a-Mole as problems pop up and disappear, seemingly at random.
This is the situation our customer found itself in with one of its popular security software products. After years of market success, the company found that its customers were using the software in ways they hadn’t predicted on bigger workloads than they had ever imagined. While it was terrific to see the software being used in new and creative ways, it created challenges that several different developers helped to address. Yet, with many developers contributing to the code, it created inconsistencies in the log levels between classes. Additionally, over time the team found that false alarms were being thrown, with logs that were so verbose that it became increasingly difficult to tell what was happening. As a result, engineers were spending more time than desired wading through logs to troubleshoot.
The development team decided things needed to change and turned to NTT DATA to help wrangle their logs into a manageable herd.
Our Site Reliability Engineering (SRE) team started by reviewing logs and talking with those team members responsible for log monitoring and root cause analysis. The team quickly found several inconsistencies in how developers added logging to the application, especially when it comes to log levels. Another major problem was that logger was set to send logs at a very low level or threshold, causing a lot of unnecessary information to be added to log files. Sifting through this unneeded data added unnecessary time to the log management process.
Let’s have a pause here and do a quick recap on log levels. The key thing to remember here is that it’s all about how much detail you want to see in your logs. You may have a log printed for every line your program hits or you may have none. Or you can just log the important events such as errors or warnings. Log levels helps you to categorize your logs based on their importance and urgency. Some common log levels are FATAL, ERROR, WARN, INFO, DEBUG, TRACE and ALL.
The lower you go in the list, the more details you’ll get. For example, if you set your log level to FATAL you won’t get any error, warning or info logs at all. Conversely, if you set your log level to ALL, you’ll have a very noisy log file, the result of which is that it can be a real headache to find what you’re really looking for. Yet, if you chose not to log anything or very little, you could miss that very important piece of information that will help you solve the problem at hand. So, adding the correct logging lines at the correct places and setting their levels properly can save you priceless time -- especially when you’re dealing with problems happening in your production environment.
With this explanation in hand, the idea was simple: get rid of unnecessary logs, set proper log levels and observe the results. However, with a large, old code base, things are not that easy.
Addressing log levels
The first challenge we had was the monolithic structure of the application. We had to scan each file that produced logs and fix them with an appropriate log level. Additionally, while we were not the ones who developed the application, it was difficult to tell which details should be in the logs and which should not.
Collaborating with the customer, we went through the whole application and adjusted the logs and log levels. We also added some conditionals when it comes to debugging logs. Putting unnecessary calculations behind a log level check helped us to save computation resources, too, as we told the application not to calculate anything if the application level is not set to debug. Even though it’s a small change, when you have hundreds of lines of such calculations, it can slow down your program and can be quite costly.
When developing software, as in life, preventing problems is a much better way than trying to fix them after they occur. Adding logs to your software is no different; it’s crucial to start correctly and keep that posture as your software grows. It’s always a good idea to have standards and make sure your developers follow them.
Example standards we recommend are including testing and documentation as a major part of the development process, and having your logs sanitized and in order. This second standard happens to also be very important if you don’t want to waste valuable man hours and resources.
After helping our customer assess its log levels, tuning for the right level of data and alerts, its developers have found a significant decrease in the amount of time they spend on log management and have decreased their false alerts. Moreover, tracing alarms has become faster and less of a ‘needle in a haystack’ exercise, allowing the team to address issues quickly, keeping the reputation of its market-leading software in tact.
Could your team benefit from greater resource optimization like this? Learn more about our transformation services.
Post Date: 07/20/2021