Do you want to become an A-player in your company? That employee whose value instantly doubles? The secret is simple: learn to solve problems from the ground up. Especially those that others call unsolvable.
In the business world, there's a special caste of specialists— troubleshooters . These are the people who "shoot" problems away. Let's explore who they are, how they differ from crisis managers, and what tools they use to put out fires within 24 hours.
Troubleshooter vs. Crisis Manager: What's the Difference?
Many people confuse these concepts, but these are fundamentally different roles with different planning horizons and tools.
Troubleshooter (firefighter)
- Metaphor: A fire specialist. He arrives, contains the fire, installs a temporary patch, and saves the unit.
- Focus: specific (a specific process, function, warehouse failure, website crash).
- Time horizon: hours and days (Day 0, Day 1).
- Success metrics: MTTR (time to recovery), damage reduction.
- Tools: Warroom, SAVE 24, Change Sweep.
Crisis manager (ship captain)
- Metaphor: a crisis manager who saves the entire sinking liner.
- Focus: broad perimeter (entire company, reputation, liquidity).
- Time horizon: weeks and months.
- Success metrics: business viability, cash flow, reputation.
- Tools: anti-crisis program, restructuring, work with investors.
The main rule
- Troubleshooter fixes the problem quickly and efficiently.
- The crisis manager saves the entire system.
- They often work in tandem: the troubleshooter extinguishes acute problems, while the crisis manager develops a long-term strategy for exiting the nosedive.
Problem or Trouble? How to Know When to Sound the Alarm
Not every difficulty is a Trouble.
- Problem (прометке): planned task, risk is controllable, damage is limited, time is available.
- Trouble: an emergency situation, damage is growing hourly, deadline is approaching, key clients are affected.
SAS 3H + Trouble Gate Tool
To understand your situation in 2 minutes, use the SAS scoring system. Rate the situation on a scale of 1 to 4 in three areas:
- Severity: from inconvenience (1) to stopping a critical process and legal risks (4).
- Urgency: from “will wait a week” (1) to “point of no return in 24 hours” (4).
- Spread: from one employee (1) to the entire company/key clients (4).
Incident Count and Class: Sum of S+U+S → Class & Mode
•9–12 = P1 (Trouble): Red Mode: SAFE-24, Frequent Updates, DO Assigned
•7–8 = P2 (Trouble): Orange: Containment Today, Fix Planned for Tomorrow
•4–6 = P3 (Problem): Standard Track, Slot for Analysis in the Next 3 Days
•3 = P4 (Problem): Backlog/Monitoring
Trouble Gate: Add up the points.
- 7+ points: This is a Trouble (P1 or P2). Enable troubleshooting mode.
- Less than 7: This is a Problem (P3 or P4). We will resolve it as planned.
4 steps of Troubleshooter operation
The work of a troubleshooter is not chaos, but a clear algorithm.
Step 1. Localize (stabilize)
Goal: Stop the "bleeding" (damage).
Here we use the Day 0 tactic. Don't try to find the cause and the perfect solution right away. First, apply a tourniquet to stop the bleeding.
SAVE 24 Tool (Playbook for Day 0)
This is a checklist of actions for the first 24 hours:
- S (Stop the bleeding): Isolate the affected area. Disable integration, halt shipments, and throttle.
- A (Alert): Notify everyone in the scheme, assign a DO (Decision Owner).
- F (Freeze changes): Freeze all changes. No new releases, deployments, or changes to procedures until the fire is extinguished.
- E (Eye / Monitor) : Enable monitoring. Display 3-5 indicators (leading and lagging) on the dashboard.
- Workaround: Implement a temporary “crutch” (manual process, alternative contractor).
- Log: Keep a log of your actions. This will serve as a basis for future debriefing.
Step 2: Find the Cause (Diagnosis)
Goal: To make a quick diagnosis, like Dr. House.
Change Sweep Tool (6D)
What changed before everything broke? Check 6 domains in 15 minutes:
- People: shifts, new employees, layoffs?
- Process: new instructions, regulations, schedules?
- Platform: releases, patches, software updates?
- Data: data import, stock recalculation, master data modification?
- Policy: new rules, discounts, return policies?
- Environment: weather, external events, supplier actions?
Event Timeline Builder Tool
Create a timeline (like in the Chernobyl series). Reconstruct the events minute by minute, from the first symptoms to the peak. Overlay the changes (from the Change Sweep) on the metrics chart. Where the change coincides with the problem spike is your hypothesis.
Step 3: Suggest a solution
Goal: To choose the fastest and safest solution.
Evaluate options using the Impact/Risk/Effort/Reversibility matrix. Choose the option that provides the greatest impact with the least risk.
Step 4. Implement under control
Goal: To gently incorporate the solution while avoiding side effects.
Use pilot groups and canary releases. Don't freeze the circuit until you're sure it's stable.
Managing Chaos: Communication and Accountability
RACI + DO Tool
Democracy doesn't work in times of crisis. There must be a single Decision Owner (DO)—the person with the mandate to make the final decision "here and now."
Complete the RACI matrix for the incident, clearly identifying who does (R), who is responsible (A = DO), who helps (S).
STIC (Update Format) Tool
To avoid panic, communicate clearly and regularly (update cadence). Use the STIC format:
- S (Situation): Facts and metrics now.
- T (Task): What we are doing right now.
- I (Intent): Why are we doing this? Expected effect.
- C (Concern): Risks and blockers. What's stopping you?
- C (Calibration): What are we waiting for? When's the next update?
The biggest mistake newbies make: Jumping to Fix without Triage or Save. Don't be a hero who puts out fires with gasoline. Be a system troubleshooter.
Want to delve deeper?
Watch the TV series "Chernobyl"—it's the perfect tutorial on how (or not) to handle P1 incidents.
Better yet, I recommend taking the Troubleshooter PRO course.
🚩 Be the one who solves the unsolvable. Don't miss out.