Troubleshooting is a skill often overlooked when it comes to working in the IT industry. While I’ve not studied computer science it’s not a skill that I see many, if any, courses seem to focus on; similarly many professional qualifications, certifications and courses prefer to discuss features and functions of their subject matter rather than problem solving if that technology is not working as expected. None of this is to disparage the value of these courses, but it’s important to recognise that troubleshooting is a skill that also needs to be developed and practiced. I would argue that for most IT professionals troubleshooting should be viewed as a core skill. The challenge is, how do you develop a skill that’s only needed when something goes wrong? Let’s explore that now…
This post may end up a little long as I try to give you detail and context on how to approach this vital skill. For those in a hurry let me try and distill it down to a few key points:
- Understand a working system - If you don’t know what working looks like, how can you identify a problem?
- Identify the problem - Ask questions, looks for patterns and indicators, and try to isolate the problem to a single user or system.
- Understand the problem - Use your tools to help you understand the problem, recreate the problem, and plan your actions.
- Resolve the problem - Make one change at a time, test it, and repeat until the problem is resolved.
- Document - Document everything you do, the tests you run, the results, the actions you take, and the results of those actions.
For those with a little more time, let’s explore this in more detail…
What is troubleshooting?
Let’s start by exploring what we mean when we say “troubleshooting”. Dictionary.com defines troubleshooting as: “the act or process of discovering and resolving problems, disputes, or mechanical or technical issues.”
So what we’re talking about here is the process of identifying, understanding, and resolving problems that develop in a system. We’re going to focus on troubleshooting in the context of IT systems, but much of the discussion is applicable to any system, electrical, mechanical, or otherwise.
I think it’s important while we’re getting started to differentiate the various aspects and phases of troubleshooting. It’s not particularly useful to have skills to implement a fix if you don’t know there is a problem to start with. I typically think about troubleshooting as having three phases:
- Identifying that there is a problem
- Understanding the mechanics of the problem and how it manifests
- Resolving the problem.
There’s actually a hidden, fourth, phase that I think is vital for any engineer that is working with the same system for a period of time should focus on too, and that is understanding a working system. From understanding the systems that we work with we can troubleshooting more effectively and prepare prevention and mitigation steps to minimise the number and reduce the impact of issues when they do occur. We’ll start with understanding a working system first.
Understand a working system
I believe that this is the most important single skill an engineer can bring to the table when troubleshooting; an in depth understand of a working system. At the end of the day, if you don’t know what “working” looks like how can you identify a problem successfully? I mean; sure, we can look in a log file and find words like
error, but do they actually, truly represent an issue?
An example from my own experience; one product I used to support would routinely log
[ERROR] events in application logs when a client device it was communicating with lost connection. From the server perspective it was an error, a client device that was there one moment suddenly stop responding. However, if I now tell you that the client device was a wireless device worn by human users, who sometimes did things like walk outside for a break, or into a lift, areas where wireless coverage was poor or non-existent you hopefully will see that these events could be expected and would recover by themselves in the many cases, is this truly an error? Programmatically, from the server perspective, yes, it is an error, the server cannot differentiate between a user walking into a lift or out of the building and when the client device goes offline due to a fault.
The point of the example is that every new engineer into the company, including me, would gravitate to these “errors” in the log and try to chase them down as a problem when working a support case. Given some experience and understanding of what “normal” looks like would have told each engineer that these events did not need investigating.
I give over significant proportions of time when training new engineers to experiment with a system in a working condition and deeply review logs, network traces and other data. I encourage us all to make notes of what a working scenario looks like. Do you know what you should see in the logs for your application or process when it starts up, when it connects to a server, when it connects with a client, when an API call is made, etc.?
It’s not just software developers and support engineers either, if you’re a network engineer what path or paths should traffic be taking and under what conditions? If you’re a storage engineer, what’s a good latency value for your environment? Hopefully, you get the picture? When a log tells you an action took 100ms, is that good or bad? If you don’t know what good looks like, how can you tell?
An approach to learning a system
There’s no secret sauce here; just allocating the time to understand the interactions a system is expected to make and receive and to then review data such as logs and traces to understand what we see when these interactions take place. I suggest the following as an approach:
- List all the systems, servers, and services that make up a system
- Is it a standard 3-tier web application with a web front end, application server, and database?
- Is it a microservice architecture with multiple services?
- List all the expected ingress interactions for these systems; where does user interactions or data come into the system?
- Web API endpoints
- User web forms
- List all the expected interactions between system components; how do the systems interact with each other?
- Is the system a monolithic application or a microservice architecture?
- Are system components loosely or tightly coupled? (Cohesion and Coupling)
- API calls
- Database queries
- List where data and/or artifacts are stored
- File or object storage
- List all the expected egress interactions for these systems; where does data leave the system?
- Web API endpoints
- User web forms
Once we understand the architectural structure of the system, we can then start to look at how users and other systems interact with the system. For example, where does a user click, or what data is mandatory vs. what might be optional. We can also look at API schemas to understand the required and optional data and the format it should be sent in, should it be encoded as JSON, XML, or something else? One common term here might be “User Journeys”; what are the sequences of steps and interactions a user will take to achieve a certain outcome?
At this point all we’re focusing on are normal, working interactions and behaviours. We’re not looking for errors or issues yet; we’re just looking to understand what normal looks like. We’re also not looking at the data itself, we’re looking at the structure of the data and the interactions that take place. We’re looking for the “how” and “where” of the system, more than the “what” and “why”.
As an example, an old system that I used to work with would ingest a raw text stream on an HTTP API endpoint and then store the received data stream as a file written to a local disk. The data was formatted as plain text using features such as
^ as separators, some of you might have heard of HL7 Version 2? Further components of the system would then pull the files from the disk across a local network to continue processing them.
By looking at systems in this way we can start to understand what format the data is in at any given point in the process and determine if we’re converting formats which might experience parsing issues with unexpected data or potentially experience other, similar errors. We can also start to understand the expected flow of data through the system and where we might expect to see data at any given point in the process. For example, if we’re expecting to see data in a database but we don’t, we can start to look at the interactions that should have put the data there and see if we can identify where the data is being lost.
Tools to help understand the system
OK, so we understand the architecture of the systems and the expected data flow. How can we validate, prove, and build further understanding of what we believe is happening? I suggest that we follow the User Journeys that we have identified and then use tools such as those I list below to align the interactions we have made to log and trace data. This will help us to understand what we see in the logs and traces and how they relate to the interactions we have made.
- Log files
- Network traces and flow logs
- Database traces
We can also use this process to explore common mistakes and errors; what happens in the logs if the user clicks this button at the wrong time, or adds the wrong kind of data in a field?
Documentation, documentation, documentation…
Once we have explored a working system, we need to make sure that our understanding is captured and documented. This includes all the things we’ve discussed above such as architectural diagrams, user journeys, log snippets and traces associated with a given user journey, and the same snippets and traces for common mistakes and errors. This documentation should be kept up to date as the system evolves and changes over time. It should also be shared with the wider team so that everyone has a collective understanding of the system and how it works.
Tools such as Confluence, SharePoint, even OneNote can help with documenting and sharing this information and using tools like Visio or draw.io can help with creating architectural diagrams. If you don’t have a tool like Confluence or SharePoint available to you, then consider using markdown files in a git repository to document and share the information. The key is to make sure that the information is available to everyone who needs it, that it is kept up to date and that you can see the history of changes to the documentation over time.
Identifying a problem
OK! We now understand our system; so, ask yourself, how can we identify that we have a problem?
The first steps are…
- Don’t jump in…
- Take a breath…
Now you’re ready…
The initial report
Often, in an IT sense, this may start as a report of unexpected behaviour, or lack of behaviour, coming in from users, e.g. “I cannot log in to system X”. Perhaps we have monitoring in place that can indicate that something is not behaving correctly, e.g. a rapid increase in 404 errors from our web service?
In itself though, these are merely indicators of a problem, they are not the identification of the problem itself, although they may contain clues. Particularly with user reports we need to bear in mind that they are not always technical and that they may not be articulating the problem so much as the thing that they cannot do. For example, in a previous role, I received a call from a user reporting that they could not log in to their computer; however, I happened to know that the entire site was experiencing a power outage at the time. The user was in an office well-lit from windows and so the first thing they noticed was that their computer was not responding, they had not yet tried a light switch or anything else electrical to realise that this was just a manifestation of the wider issue that the entire site was without power. This is not to disparage users, but to highlight that we need to be careful to not take reports at face value without investigation or interrogation and to consider the wider context.
Finding the pattern(s)
We have an initial report of a potential issue, what else can we do? I would suggest that we look to quantify the issue; does it happen every time? Does it only affect this one user, a small subset of users or everyone? If it’s a limited number of users or systems that are affected, is there anything that differentiates them? Does the issue only appear at a certain time?
A key skill during this phase is to start looking for patterns. It’s easy to get 2 or 3 vocal users report an issue, and for them to over inflate the issue to try and force a priority response. It may seem like a major, system wide failure is occurring. Your job in to cut through the noise and find the reality of the situation. For many IT systems, if you have monitoring in place this can help a lot but be aware that like user reports if you focus on a single signal, you may miss the wider context. Spending time here to try and understand the high-level of what is happening can be a great investment in time and effort and help you to start the process of isolating and understanding the issue.
To give another example; while working in one of my previous roles we had reports that a customer system was down and that all users couldn’t use the system as expected, it simply wouldn’t work. Every call that came into the customer’s internal helpdesk was the same and implied that we had a complete system outage. However, as I started to question the scenario, we noticed that every user who called was in one of a small handful of departments and that all those departments were physically located in the same building. Further investigation of log data suggested that we had a substantial number of users working normally. So, I asked the customer engineer I was supporting to clarify this and contact users in other areas and determine how the system was for them. Most of their reports were that the system was fine and the few who reported issues when questioned further were experiencing issues communicating with users in the impacted departments. So, for more than 80% of the user base everything was fine, and they weren’t calling in because they weren’t experiencing issues. This helped us isolate the issue to the one building and reduced the issue from a complete system outage to a localised outage. Still important to diagnose and fix, but not as critical as the initial calls made it seem.
Who, What, When and Where…
OK! We’ve got our initial reports and we’re starting to look for a pattern, we understand how our system works under normal conditions, now how do we identify the problem? I recommend clarifying the following information through questions to yourself or to the reporter(s) of the issue:
- Is affected?
- Is not affected?
- Is it a single user, a group of users, or everyone?
- Ideally you want to get down to a single user or system that you can trace in the logs as an example, perhaps a few examples if you have a few different scenarios.
- You’re looking for something identifiable in the logs and traces that you can find and follow.
- Also consider who isn’t affected, this can help you to understand the scope of the issue.
- What are they trying to do?
- What is the expected behaviour?
- What is actually happening?
- What has changed recently?
- What are the steps to reproduce the issue?
- You’re looking for details that you can use to try and recreate the issue yourself and/or actions that you might be able to see in the logs.
- When did the issue start?
- When does it happen?
- When does it not happen?
- You’re looking for a time frame to focus your investigation on, and/or a time frame to look for events in the logs.
- Where does it happen?
- Where does it not happen?
- You’re looking for a location to focus your investigation on, and/or a location to look for events in the logs.
- Can you replicate the issue?
- Can the reporter(s) replicate the issue on demand?
After you’ve worked through these questions you should have a tight time box to search through logs and traces. You should also have specific users or systems within that time box that you can follow through the logs and traces. Make sure that you have a clear understanding of what the user is trying to do and what they are expecting to happen, even have an example from earlier (or later) where things worked as expected for comparison.
Review and brief
This applies whether you have a team to help you troubleshoot or if it is just yourself. Thinking of it in a comparable way to rubber duck debugging. The aim here is to make sure that you have a clear understanding of the problem and that you can articulate it to others. This is important for several reasons, firstly it helps you to clarify your own understanding of the problem, secondly it helps you to communicate the problem to others, and thirdly it helps you to identify any gaps in your understanding of the problem. If you can’t explain the problem to someone else, you don’t understand it yourself.
One approach I use here is to use the STAR technique. I discussed STAR before in the setting most people have heard of it, in relation to interview technique. What you should understand though is that STAR is much more versatile than just interviews; it’s really a method for clearly communicating a situation. So, for troubleshooting, think of STAR like this:
- Situation - What is the situation currently? What is the problem? What is the issue? What’s the impact, criticality, and context?
- Task - What are the users trying to do and cannot? What are they expecting to happen?
- Action - What actions have you taken, and what was the result of the action(s)?
- Response - Changing slightly from Result to Response; what are you going to do next based on the actions that you have already taken?
Before you start troubleshooting, go through this process to make sure that you’re clear on things. Then, as you proceed through troubleshooting, keep coming back to this process to make sure that you’re still clear and that you’re communicating clearly with others. It also helps you to assess the results of anything you have tried in attempts at resolution and to plan your next steps.
Understanding the problem
OK! We’ve used the skills above to isolate an example, or some examples, of the problem and we’ve got a clear understanding of what the user is trying to do and what they are expecting to happen. Now what?
Review your tools
What tools do you have available to you which might help troubleshoot the issue? Do you have any or all of:
- Monitoring output
- Network captures
- OS performance data
- An issue which is persistent or that you can replicate on demand
Also think about external tools and whether they might be relevant. For example, if the issue is with an industry standard component like an Apache web server could you use a search engine like Bing or Google to search for others who have seen the error or issue before? You might be able to ask a tool like ChatGPT for guidance by describing the issue and seeing what suggestions it produces? The obvious warning here; do not put any controlled, sensitive, or customer data into a public search or AI tool.
Recreate the problem
A great way to understand the problem is to work through recreating it, this will allow you to work through the problem, step over each action or interaction in turn and probe and test the system to see the conditions under which the issue occurs or doesn’t occur. In an ideal world this might be in a lab environment, after all you have one for understanding a working system, right? Lab systems are a great place to experiment and test and are a go to recommendation from me. If you’re building anything you should consider having one and potentially several environments for testing and experimentation. With Virtual Machines and containers, images, and snapshots, this is often easier than you might think. If you don’t have a lab though, don’t panic; if the system you’re troubleshooting is already impacted then having you continue to test and recreate the issue while parsing logs and traces is unlikely to do more damage in most cases, although use your judgement here.
Take the scenario(s) from your Who, What, When and Where questions and work through them while inspecting the logs and/or capturing traces. Compare the current log responses to the expected responses from understanding a working system. Look for differences and anomalies. If you don’t have known good data, then you will need to rely on your instincts and intuition a bit more here; look for signs of warning or errors in the logs at the moment you execute your tests, this is why accurate time stamping for events is so important. Look for unexpected responses from the system, a 404 error when you expected a 200 OK response. Look for unexpected delays in responses, a 200 OK response but it took 10 seconds to respond when you expected it to take one second.
Based on your findings and with your system knowledge from understanding a working system you can assess whether you have enough information to identify the problem, or whether you need to continue to investigate further, remember to keep coming back to your STAR process to make sure you’re clear on things. All being well your existing knowledge and from what you have found in logs and recreation testing will lead you to an initial assessment of the problem and a plan of action to resolve it.
If you’re new to the system, or the background knowledge isn’t available then stay calm, start from the What and When questions and work through the process from the beginning. You may need to go through this process several times to build up a picture of the system and the problem. This is OK, it’s part of the process and it’s why we have the STAR process to help us keep track of things. As your knowledge and picture of the situation develops you can start to plan what actions may make sense based on what you are seeing.
Resolving the problem
OK! We’ve used the skills above to isolate an example, or some examples, of the problem and we’ve got a clear understanding of what the user is trying to do and what they are expecting to happen. We’ve also used our tools to help us understand the problem and we’ve got some ideas of actions that we can take to resolve the issue. Now what?
One thing at a time
I cannot stress this one enough; do not make multiple changes at once. If you make multiple changes at once and the issue is resolved, you will not know which change resolved the issue. This is a common mistake that I see people make, particularly when they are under pressure to resolve an issue quickly. Not only will you not know what resolved the issue, but if you make several changes at once you end up in a situation where did one of the changes you made then block another, later change from fixing the situation? Did you inadvertently make the situation worse by making multiple changes at once?
So, I’ll say again, make one change at a time
Once you have made a change, test the system, and review the log and trace data again. Assess whether the change you made fixed the issue, improved but didn’t fully resolve it, had no impact, or made it worse. Whatever the answer, go back to STAR to assess the current situation and if you think it is resolved that you’re clear that it is, and that you can articulate the resolution to others. If the issue persists, or is only partially resolved, then you can start to plan your next action.
Change, test, repeat
As we said above, there is a process here. Make a single change, test it to the best of your ability, if it looks good to you have the users try again and validate your results. If the user results are not positive, or your own tests show no fix re-evaluate the situation and plan your next action.
In all cases try to let the data guide you. If you’re seeing errors in the logs, or unexpected responses from the system, then you can use these to guide your next actions.
Document, document, document…
We’ve already talked about this, so hopefully it is self-evident now but as you work through a troubleshooting process take copious notes, ideally in a centralised place that others can access. Note the tests you ran, the results with log snippets you found, the actions you planned and took, and the results. Also document any gut feeling, intuition, or instinct you have about the situation, it may help you or others later. Keep the notes short and to the point, bullet points are great. Most of all, keep the notes in order and ideally time stamp them as you go. This will help you to review the process later and to understand what you did and why.
From a previous role, I used to work in technical support. We used a support ticketing tool, and I would have a case comment open throughout all steps of troubleshooting. I would bullet point note every relevant comment or test from the users, the results, what I was thinking at the time (such as systems or logs to go and check), and my findings from logs and traces. I’d frequently time stamp comments, or commit the case note to the case file which would apply a time stamp for the comments. Often, I’d have to step away from an issue, or it was a longer running ticket, and whether it was me coming back or someone else taking over, having a clear view of what has occurred and what the current situation is was vital to being able to pick up the case and continue to work on it.
Root Cause Analysis (RCAs)
As important as getting a fix at all is documenting the fix and working to understand how the issue occurred. This last step, investigating how the issue occurred and developed, is often bypassed in favour of the next issue or thing arising.
Taking the time to investigate the causes that led up to the incident can help you plan system improvement to prevent or mitigate the issue in the future. It can also help you to identify other issues that may be developing and to take action to prevent them from becoming incidents in the future.
In most cases these would take the form of reviewing your test results, the final fix(es) and the logs associated, then reviewing the original log data leading up to the start of the incident and looking to determine what was changed or introduced to the system that led to the incident. This can be a time-consuming process, but it is a vital one. It can also be a great learning opportunity for you and your team to understand the system better and to improve your troubleshooting skills.
Troubleshooting is a vital skill for any IT professional. It is a skill that needs to be developed and practiced. Remember these key points:
- Stay calm
- Wherever possible, try to understand a working system before issues occur
- Take the time to fully understand the issue by asking questions before you start troubleshooting
- Make one change at a time
- Document everything
I hope that this post has given you some ideas on how to develop and practice your troubleshooting skills. I also hope that it has given you some ideas on how to approach troubleshooting in a structured way. I’d love to hear your thoughts and feedback on this post, and any ideas you have for improving it or if this article helped inspire you please consider sharing this article with your friends and colleagues. You can find me on LinkedIn or Twitter. As always, if you have any ideas for further content you might like to see please let me know too.