Incident Management vs Problem Management: Is there a
Difference?
ITIL has been around since the late 1980s. We are currently
on version three (v3). There are a lot of books and courses about ITIL but
there’s still real confusion about where Incident management stops and Problem
management begins, and the difference between the two. If it was just a
terminology issue I wouldn’t be so worried about it, but the reality is –
confusion about incident and problem management hurts us all.
Now confusion between two terms and definitions wouldn’t
normally be such a big deal, but not being familiar with the differences
between these two processes can end up having a huge negative impact on not
only your infrastructure, but your business as a whole.
Here are the meanings of each word, according to the
definitions used by ITIL, and how these meanings translate into the timeliness
of the fix needed:
What is an Incident?
An incident is an event that leads to an unplanned
disruption of service. The important part to remember is ‘disruption of
service,’ because if an issue does not disrupt service, even if it was
unplanned and unexpected, it is not an incident. For example, if a piece of
hardware fails after hours when nobody is using the system, it is not an
incident, because it did not disrupt service. However, if the same equipment
failed during the regular workday, it would be defined as an incident because
service was, in fact, disrupted. The IT help desk is often the first ones to be
made aware of an incident, as they are usually the first point of contact for
users experiencing issues with the system.
What is Incident
Management?
Picture Courtesy: http://www.seriosoft.com/
The main goal of
incident management is to resolve the disruption as soon as possible in order
to restore service operations. The objective of the Incident Management Life cycle is to restore the service as quickly as possible to meet Service
Level Agreements. The process is
primarily aimed at the user level.
Due to the fact that even minor disruptions in service can
have a huge impact on the organization, it is necessary to fix incidents
immediately. The process of incident management usually includes recording the
details of the incident and resolving it.
What is a Problem?
Also according to ITIL, “a problem is a cause of one or more
incidents”. This problem is initially unknown and results from a number of
incidents that are related and have common issues. While problems are not
classified as incidents, incidents can raise problems, especially if they may
or do happen repeatedly. To refer to our above example, the situation of the
server that is only used during the day crashing after office hours is a
problem because although it isn’t currently causing a disruption in service, it
could happen again and become an incident.
What is Problem
Management?
Picture Courtesy: http://www.seriosoft.com/
The goal of problem management is to identify the root cause
of the incidents and try to prevent them from happening again. It might take
multiple incidents before problem management can have enough data to analyse
what is going wrong, but if undertaken correctly, it will help the problem
become a “known error” and steps can be put in place to correct it.
Sometimes problem management is referred to as a reactive
process that begins only after incidents have occurred. In actuality, problem
management should be thought of as a proactive process because its end goal is
to identify the problem, fix it, and prevent it from ever happening again. So, you could say the main goal of problem
management is to identify the problem, troubleshoot it, document the issue as
well as the causes of it, and then ultimately resolve it. Problem
Management deals with solving the underlying cause of one or more incidents.
The emphasis Problem Management to resolve the root cause of errors and to find
permanent solutions. This process
deals at the enterprise level.
Now, let’s look at an analogy comparing Incident management
and Problem management
Incident
management is like a fire-fighter at a house fire: it comes in,
immediately fixes the problem, and saves the day. Fire-fighters come to the
scene and notice the issue, and work fast to put out the fire as quickly as
possible without stopping to question how it started. This is a similar
situation for incident management. While it is necessary for incident management
to provide fast results and repair issues within the infrastructure, it doesn’t
help us find out what ultimately went wrong and why there was an issue in the
first place. That’s where problem management comes in.
Problem management
is like the detective that comes into the picture after the fact. They
weren’t there to put out the flames themselves, but they can still investigate
what went wrong, figure out how the fire started, and help educate people to
take preventive steps so something similar doesn’t happen again. Problem
management is a vital piece of the puzzle as it addresses the root cause of the
incidents and proactively prevents them from repeating and potentially causing
major issues in the future. Without taking time to review incidents and problem
solve, they will just continue to happen and potentially increase in
seriousness.
Conclusion
Understanding the difference between Incident
management and Problem management, and having dedicated managers for each
separate scenario, ensures that you are not just putting out fires all day.
While immediately fixing problems in the infrastructure with incident
management provides temporary relief, it will soon exhaust your resources and
employees without finding the root of the problem. Bringing in problem
management helps to investigate the cause of the incidents and puts steps in
place so it doesn’t continue to occur. By having a specific manager or team for
this process, you will be one step closer to decreasing the rates of incidents
in your organization and preventing major outages and service disruptions
Nice guide for anyone aspiring to make a career in the aforementioned field. Appreciate your time and insight Luqman. Thank you very much
ReplyDeletePleasure's mine Mayank !!!
DeleteVery nice mate and nicely explained thanks for the insight.
ReplyDeleteGood one Luqman helpful information
ReplyDeleteDo keep adding more. Thank you
Thanks !!!
ReplyDeleteWonderful Article. Thanks for sharing this post
ReplyDeleteSite Reliability Engineering Training
SRE Training in Hyderabad
Site Reliability Engineering Training in Hyderabad
Site Reliability Engineering Online Training
Site Reliability Engineering Training Institute in Hyderabad
SRE Training Course in Hyderabad
SRE Online Training in Hyderabad