As a continuous improvement process and just like firemen, IT Operations should spend some time on trainings about various scenarios. I believe it helps preparing in the case of unpleasant events, disaster or at least documentation to validate that all the procedures are in place. This notion has been around for a while especially the famous Disaster Recovery Plan which used to be a nightmare for lot of teams but was reserved for large companies driven by regulation compliance.
However here I propose to broaden a bit the scope of this practice and give some hint on how to integrate it in an agile manner for continuous improvement and gamify the exercise. Hopefully this will introduce lightly Risk Management and how to implement it in your organization.
1 – What are the training sessions about?
Fiction versus Reality
There are two steps in operational trainings:
- Make hypothesis and build procedure upon it
- Run it through fire and see if it actually works
Depending on the scenarios chosen it might be tricky to test it all but I believe that even only the brainstorming around the hypothesis and its documentation greatly helps the teams being more comfortable around unexpected events and plays a great deal in the continuous improvement policy within the organization. And trust me in your career these events will occur so it’s always interesting to see how to prepare for it. There are some main categories that usually drives the training sessions.
That is the most straightforward category, and it boils down to ask the question “What if X fails ?”. The scope for this can be quite broad, but usually divided in those different layers :
- software component (service, database)
- hardware component (server crash)
- network connection (ISP problem, 3rd party connection)
We don’t focus so much on the cause of the problem here and how to troubleshoot it yet but merely its impact on the current architecture and how to cope with it.
However in a second time we can categorize events depending on their root cause and propose proactive remediation.
Disaster Recovery Plan
As I said earlier this availability concern is mainly found in Disaster Recovery Plan which basically ask you what would happen if your data center was hit by a meteor or similar destructive events and became nonoperational. How would the business would keep going and what could be expected for recovery time . It is mainly required for compliance purpose in large companies and is really heavy to set in place. As the trend of the cloud prosper this problematic responsibility moved from regular IT department to Cloud Providers.
In this category we explore the data side of availability and its integrity with the question of “what happens if this source of data disappear?”. It could be a production database as we’ve seen recently with Gitlab event, or financial reports for the quarterly review.
How would recover those data ? Can you recover those data ? How does it impact your business and processes?
This one is also straightforward but rarely mentioned in compliance or disaster training. You probably know it better with the title “The bus effect”. What would happen if one member was hit by a bus coming to work and in recovery for several months? Do you have key members that hold too many information in their head and that would impact dangerously your business if they came to disappear for a period of time ?
Security : Confidentiality, Integrity
In today’s world it is hard to go by without seeing new security incidents every week. I believe this should be a legitimate part of risk management and training for operations. It’s delusional to believe you will never be affected by a security breach. Even if the practical side of it is hard to setup it is interesting to go around some scenarios:
- User/Password stolen or compromised
- Root access gained to one of your machine
- Service password stolen or compromised (DB access password, API token, deploy tool key … )
- Private key leakage (SSH or certificate)
RedTeam Vs BlueTeam (Security Special case)
Regarding security there have been a specific kind of practical exercise called by Red Team / Blue Team. The principle is quite simple since it’s based on Attack/Defense. The red team will try to hack services using any mean while the blue team will try first to detect the attack and then to block it or deal with it as it goes. Once the session is over both team gather and perform a retrospective to see the weak spots both in visibility and countermeasure to attacks. While this require a peculiar set of skills to organize such sessions, the brainstorm around security events is a good introduction to go around it. You can ask security oriented companies to help you by providing “real-life” scenarios or search for it yourself in security incident post-mortem articles online.
2 – Setting up your training sessions
Now that we’ve seen the theory and the different scenarios we can implement, let’s see how to animate these sessions and integrate nicely into a continuous improvement practice as a team.
The objective here is to choose a set of topic, gather your team and randomly choose a subject to brainstorm about to see the impact and risks associated with the specific scenario. Once done a retrospective method such as KPT helps you to mitigate the risks or document a solution in a knowledge base.
2 – 1 > Preparation
If you haven’t been in a cave for the past few years you may certainly have heard of agile process and continuous improvement. The aim is to incrementally improve conditions of a service or organization. It is mainly done by short cycle of action followed by retrospection which will be used as base for the next cycle. Here we propose to do the same by setting brainstorm sessions to get a grasp on risk management and mitigation associated to it. It helps both to train people in case of unexpected events, help for compliance and respond proactively to problems.
The regularity and length of the sessions are at your discretion, but I usually practiced it once a month for one hour and half. While you may think it’s too sporadic to be effective it was a negotiation with the management and it still has been proven to be valuable both for the team and the management.
2 – 2 > Topics choosing
The way to choose topics is free but if you miss some imagination, incident post-mortem stories are a really good source to start. Whether the incident occurred in your organization or that you’ve seen in the news you can usually get inspiration from it to create your scenarios
2 – 3 > Spin the Wheel of Misfortune !
In order to gamify a bit the process and add some random fun into the brainstorm I like to use the Wheel of Misfortune ! It’s a simple SVG code based on JSON data that we use on training sessions to decide the topic we’ll talk about this time. It removes the some weight from the shoulder of the animator of the session and allows a fair choice of scenario and not only what the boss wants us to look into. No offense but the manager is not always aware of the impact that might occur and the priorities chosen to be focused on are not always the best , so the randomization of this choice seemed like the best choice.
I created this little code for fun, feel free to use it, share it or improve it as you please !
2 – 4 > Play the scenario
Once the topic chosen, the real work begin. You have to impersonate the character who come in the morning and discover the incident ! If you do only the theoretical part, you first need to list all the impacts generated by the incident. I recommend that you bring the current documentation to help you map the impact. If you don’t have any documentation… well you know your first task 😉
To determine the impact usually senior engineer are the quickest to respond and see the damages, but it is actually much more interesting to ask a junior to do it. It helps him understand the connection between services, might bring some fresh new view on the problem and give him a sense of belonging to the team. Even if you correct him later on, it can only benefit to everyone.
Once the damages are made clear, it’s time to see why it failed and how we could have prevented it or how to cope with it. This might not straightforward and depending the scenario it can be rather broad.
2 – 5 > Validation of lessons learned
Once you’re into the scenario I personally like to use the KPT method that I described on this blog before.
- Keep : What is good in the current process to cope with this problem
- Problem : What is missing from the current process that could damage us
- Try : What should we implement to mitigate the impact of the incident or prevent it
I recommend that you keep a written trace of the sessions, it can be used as a basis for Risk Management later on in the life of the company or even compliance related documents to rely on.
You can then even go further by categorizing the problems encountered in these sessions :
- Fatal for the company (The unique datacenter perish in a fire)
- Life-threatening for the company (A ransomware encrypted all your customer database)
- Flesh wound (the database went down for 2 hours)
That’s basically it, don’t hesitate to give a try! I’d love to hear your feedback and how you handled the discussion. On my part even if it’s hypothetical scenarios most engineers really enjoy the exercise and benefited from the training on risk management to know how to deal with real life situation and crisis.
Through scenario life it is also an excellent excuse to present obscure part of your architecture to a broader audience and possibly fix some of it.
So are you ready to spin the wheel of misfortune ?