Technology failure is an inevitable reality that often defies convenience and challenges developers who are charged with resolving issues. Since remediation begins with triaging and escalation, our experts examine proven procedures that help developers expedite a resolution to get operations back on track.
It’s 3 p.m. on Friday, and your IT department is flooded with reports from the customer help desk about a huge ecommerce issue. Customers who are trying to submit orders are receiving an error message, and that spells L-O-S-S in revenue and customer loyalty. The Tier 1 engineer who is unable to fix the problem in this mission-critical system either kicks your IT crisis management process into high gear or punts. It all depends on your organization’s issue remediation playbook.
Triage issue impact
Which issues are most urgent and must be resolved immediately as they arise? Which are medium-priority issues to be resolved when there are no top-priority issues? Which are low priority issues to be dealt with when time permits?
To triage the issue, the developer must quickly determine:
- How pervasive is the issue?
- What are the damages (lost productivity and hard costs) caused by the issue?
- Are additional resources required to solve the issue?
Our order processing snafu is a “business essential” failure that makes it a top priority. For the plethora of other imaginable issues, a documented evaluation procedure will expedite triage.
Unlike the ecommerce issue, the error may be isolated to a single user, but the impact can be huge.
For instance, a user who authorizes a million-dollar payment to earn a discount is unable to access the system, resulting in a loss of thousands of dollars. That means the on-call application support person needs to ask the person(s) reporting the problem:
- How many internal users are affected?
- What business processes are affected? What are the repercussions?
- Are the organization’s external customers affected? What are the repercussions?
Categorizing issues using the following criteria helps evaluate the severity of the issue:
Widespread business stoppage with significant revenue impact
Risk to human health, safety and/or environment
Public, wide-spread damage to organization’s reputation
- Direct revenue impact
- Direct negative customer satisfaction
- Regulatory compliance violation
- Non-public damage to organization’s reputation
- Indirect revenue impact
- Indirect negative customer satisfaction
- Significant employee productivity degradation
Business Supporting Moderate
- employee productivity degradation
Determine required resources
Since few developers understand an entire system, they need access to the actual system or a mirrored system so they can analyze the components of the system—log files, source code, etc.—in order to uncover the root cause. Then they can determine the need for specific additional resources by asking:
- Can I handle the issue with my playbook?
- What additional resources are needed to solve the issue?
- What level of experience should these additional resources have to fix the problem?
A current list of knowledge owners of the various components of a system should be available to application support personnel, along with the unfettered ability to reach out to them for escalating an issue.
Topping this list are always-available experts for the highest severity issues. The importance of keeping this list updated—job/personnel changes, vacations, personal leaves, illnesses—can’t be overstated.
With input from the appropriate knowledge experts, different types of fixes emerge. Consideration should be given to the pros and cons of each, including their ability to prevent future incidents.
Eliminate escalation guesswork
Documenting your comprehensive crisis management process and training the entire IT support staff on how to use it makes handling an emergency simply implementing decisions that have been made in advance. This includes assigning priority levels to different types of issues, delegating responsibilities to specific personnel, and defining how much time at different support levels should be spent attempting to fix a given issue before the problem is “escalated” to the next support tier.
Communicate, communicate, communicate
Equally as important as an IT crisis management plan is a documented plan for updating the organization about progress and an estimated time the issue will be fixed. This plan should detail who is responsible for making sure this internal and external communication takes place—from those who are experiencing the problem to those impacted by the problem.
Of course, stakeholders are anxious to know when they can expect a fix, but what they really want to know is how soon they’ll be able to return to their normal work routine. In some cases, a temporary workaround will need to precede a true fix that will take longer.
In any case, it’s important to keep communication about the problem and solution at a high level to ensure that business stakeholders understand the situation. Identify business liaisons in your organization who can translate tech into stakeholder speak.
In keeping with Agile methodologies, we recommend keeping these relationship managers informed through daily 15-minute stand-up meetings that highlight updates and roadblocks. They, in turn, can keep stakeholders apprised and manage expectations.
Practice to perfect your process
Take the IT crisis management plan that looks great on paper a step further by testing it on a regular basis to make sure it works. Practice will uncover flaws as well as out-of-date information about the escalation chain. And testing will make your team more proficient in triaging, troubleshooting, and escalating. As a result, they’ll solve problems more quickly and communicate more effectively.
Or if you discover your team is lacking certain critical abilities, you might consider filling gaps with expertise from an outside service company. In that case, let’s have a conversation.
If you found this article helpful, feel free to share it with colleagues.