When Disaster Strikes

The Best Laid Plans of Mice and Men

Oct 06, 2022

If you’ve spent much time in the tech world, you have most likely experienced at least a few disasters. Whether it’s a customer-facing service outage, data leak, or an account hack, sooner or later, you are going to face a crisis.

This article is an overview of disaster response strategies to mitigate the impact to your company and personal reputation.

Don’t Panic!

"Don't Panic. It's the first helpful or intelligible thing anybody's said to me all day."
- Douglas Adams, The Hitchhiker's Guide to the Galaxy

Your phone starts blowing up with monitoring alerts, “Service Down”. Maybe it’s 10am on a Tuesday (best case), maybe it’s 2am Saturday morning (ugh). First thing to do, take a deep breath. No problem ever got solved by panicking. It’s imperative to keep a calm, level head when dealing with emergencies. If you find yourself flustered and unable to focus, try box breathing - if it’s good enough for the Navy Seals it should work for you too. Both panic and calm confidence are contagious. Your attitude will influence those around you, and a calm team will be able to handle an emergency effectively, whereas a panicked team will not.

Be Prepared

Incident Response Team

When emergencies happen you need to have a response team in place to react as soon as possible. At least two senior members of the tech team should be on call and reachable at off hours if need be. This can be a rotating schedule where on any given week, a couple senior members of the tech team will leave their phones on and be available to respond in case of emergency. I recommend having two team members available because, A. it’s much easier to bear the stress and responsibility of handling an emergency with help, and B. it doubles the likelihood that someone will be able to respond right away. Even when people are on-call, they may not be able to respond immediately due to running an errand, getting dinner, etc.

The Incident Response Team Leader will be a senior or director level member of the engineering team, who will organize the response and communicate with the heads of the affected business units of your company.

Disaster Recovery

Developing disaster response and recovery plans is a large topic, outside the scope of this article, however I would like to highlight a few key concerns.

For most organizations your data stores are business critical. Backups should be automated. The frequency of backups are determined by the business impact a period of loss would incur vs. the effort and resources it takes. Daily is a typical minimum. Many modern data storage managed services, for instance, AWS RDS, or DynamoDB, come with automated backups that are easy to configure and provide snapshots multiple times per day. Geo-redundancy of backups should be considered in case of a disaster that takes out an entire regional data store.

Once you develop a backup and recovery strategy for your data stores, it is important you rehearse the data recovery plan to verify it works, and know the time it takes to perform a recovery. Rehearsing data recovery at regular intervals throughout the year is advisable.

A plan to rebuild platform infrastructure is another important aspect of disaster recovery. This is much more feasible by having your infrastructure as code, using tools like CloudFormation, Terraform, Ansible, etc.

Incident Response Playbooks

Having playbooks that your response team can refer to for various situations will improve the speed and effectiveness of response. A playbook typically defines the key roles needed, contact details, and steps to be taken after the occurrence of an incident. Playbooks often include a decision tree based on the facts of the incident.

Example: Data Breach Playbook

This is an example of what a data breach playbook can look like. Yours may differ depending on the nature of your business and organizational structure.

Key Roles:

Technical Security Lead: A senior member of the engineering team who has a high degree of knowledge regarding the production data stores. This person will lead the technical investigation and communicate with the Incident Response Team
Business Lead: This person will serve as the point-of-contact with the different heads of business units affected by the breach. They will be responsible for communicating the impact the breach is having on the business, and relay updates from the Incident Response Team
Legal Resource: If the data breach involves customer or client data, a legal resource may need to be consulted on how best to disclose the data breach

Steps:

Investigation: The Technical Security lead will conduct the investigation and report the following to the Business Lead:
1. Date and time of the breach
2. Cause and affected data stores
3. Scope of the breach (amount of data leaked)
4. Categories of data leaked
Containment: Take action to stop an ongoing breach, and correct the point of failure so it cannot happen again.
Communicate: The Business Lead should be notified as to the nature and scope of the breach.
Assess Impact: The type of data, and impact on business and customers will determine what further steps need to be taken. If it is determined that the breach is a notifiable data breach, those steps may include:
1. Consulting Legal Resource to determine an appropriate communication to customers
2. Communication to clients. Often your SLAs will set a maximum time in which you have to communicate in the event of data breach.
Follow-up: Further investigation to see if any additional steps need to be taken to harden your platform, infrastructure, or software.

Assess The Situation

When an incident occurs, before reaching out to the response team, take a few minutes to assess the situation. Maybe it’s a false alarm. On one occasion, after business hours, I received alerts from our monitoring service saying all our production servers had gone offline. While this could’ve indicated a serious AWS outage, I suspected there might be an issue with the monitoring platform. Indeed, their health dashboard showed the monitoring platform was having significant issues. After spot checking our platform and finding everything working as normal, I relaxed, no need to bother the rest of the team.

Once you determine that there is a problem, reach out to the response team and take steps to:

1. Determine the Scope of the Issue

What services are affected?
What is the impact on the end user? E.g. increased latency, or full outage.
What level of response is required?
- Full outage on critical service - immediate all hands on deck response
- Increased latency - more investigation required as it may lead to full outage, or could just be the result of a temporary increase in traffic, or service degradation
- Data or account info exfiltration - may require an immediate response if it’s an ongoing attack

2. Investigate the Cause

Determine who is best qualified and/or available to investigate
Set up a “war room” communication channel for the team to discuss the issue
Ask the team to post status updates on what they are looking into, theories on potential causes, and discoveries they find

3. Plan a Mitigation Strategy

Brainstorm on how best to alleviate the issue as soon as possible
Refer to playbook if relevant playbook is available
If data loss is involved, create a plan on how to backfill or recover the lost data

Communication

Communicating with internal and external stakeholders is critical to mitigating the impact and perception of a disaster. You want to convey the scope of the issue, but also alleviate concerns, and project calm. Be clear not only as to what has occurred and the impact to the company and customers, but also describe what has been done in response, and how you plan on repairing and remediating the impact of the incident.

Internal Communication

Communicate with company stakeholders as soon as you determine the scope of the issue. Send an email that outlines the affected services or data, that you have the response team investigating the cause of the issue, and that you will follow up with updates on status as soon as you have new information.

In the case of critical service failure or platform outage, you should provide updates at regular intervals. It may be helpful to set a timer to go off every 30 minutes to remind you to provide an update, even if it’s simply “we are still investigating…” This provides reassurance that the matter is of the highest priority and being actively investigated.

Once you have determined the cause and mitigation strategy, update everyone with the details. Assuming the stakeholders are non-technical, be clear in your communication. Avoid technical jargon; before sending, proofread your emails, putting yourself in the mind of a non-technical reader.

External Communication

If an incident, whether it is a service outage or data loss, will take some time to remedy, the customer service team may need some help in communicating with customers. Likewise, once a mitigation/recovery strategy has been developed, this too will need to be communicated clearly.

If customer accounts have been hacked, or customer data has been exfiltrated, you may have an SLA in place which determines how much time you have before the issue has to be communicated with the customer. While no one likes to hear that their account was hacked, your response can greatly affect how forgiving the customer will be. A solid communication will contain:

A full description of the extent of the issue
A summary of how you determined there was an issue and the cause, i.e. the results of your investigation. This should be a high level description, light on technical detail.
What was done as an immediate response, e.g. “we locked your account login and sent you a request to change the password” or “a vendor had stored your data on a public S3 bucket - we deleted that bucket and locked them out of our systems”.
What further steps can be taken for recovery, with a schedule that may be dependent on their deliverables - “Our API was down for 4 hours. We can backfill this data with a CSV if you can provide one. We will have the data backfilled within 2 business days on receipt of the CSV.”
What steps will be taken to make sure this issue doesn’t happen again.

Communication that contains the above will reassure your customers, and inspire confidence in your ability to be accountable and take proper action.

Post Mortem

After every incident, once remediation actions have been taken, it’s important to do a post mortem. The goals of the post mortem are to fully understand what went wrong, to define steps to prevent the same type of incident in the future, and to provide reassurance to internal and external stakeholders.

It’s important to understand that, even if the issue was caused by an individual’s actions, it is unlikely that person is solely to blame for what happened. There is almost always a problem with process, e.g. the code deployment process, or methods used to access production data and systems, etc.

For example: A developer deployed their branch to production that had a bug causing a service outage. Questions to ask are:

How did that code get past code review with such a catastrophic bug?
Should there have been unit tests written that would have revealed the bug?
Should there be integration tests that run automatically before deployment that would have caught the bug?
Should the bug have been caught by QA testing before deployment?

There isn’t a developer who hasn’t written buggy code before. You should have a development and deployment process that will catch most bugs. In this example, the developer isn’t to blame, the process needs to be improved.

Another example: A developer was asked to provide a report from data stored in Elasticsearch, and accidentally sent a DELETE request instead of a GET request, dropping the index that stored the data. Sufficient safeguards should be put in place to prevent this from happening, such as restricting direct developer access to Elasticsearch, and creating a tool that can only run GET requests to query the data. While the developer may have fat fingered a request that deleted the data, this should be expected as a strong possibility when unrestricted access is allowed without proper safeguards.

Once you have a remediation plan for this situation, create or update an emergency playbook to handle a similar situation in the future. Communicate the findings of the post mortem and steps that are being taken to internal stakeholders. In situations where customers or clients were affected, you may wish to communicate a summary of the plan to prevent a similar issue in the future.

Summary

Every tech company will face emergencies, from bot attacks, to hacking, to data loss. How you handle these situations will make the difference between losing or gaining confidence in yourself and your company. Preparation for disaster is the key for successful recovery. Being honest and accountable with your company’s internal stakeholders, clients and customers will inspire confidence and reassurance that you are capable of handling issues in a timely and responsible manner.

JC's Tech Stack

Discussion about this post