Wednesday, September 9, 2015

Incident Response 101

A few weeks ago, we had a minor emergency: a water supply line burst in a wall and decided to flood the floor of the IT department at a rather impressive rate. Being located in the basement, the water really had no where to go, and it started pooling rather quickly. Fortunately, the burst pipe was a fresh water supply line, rather than a waste disposal line.

A gut response of most people working in a service job is that they feel the need to actively help out in a situation where help is needed. In any form of emergency scenario, as is the case with computer security incident response, there are a few things to remember. I'll list them here again, in hope that they are useful to someone.

1) Slow down. Initial reports from others, as well as your own initial assessment, is most likely incorrect and incomplete. Count to ten, take a deep breath, and re-assess the situation.

2) Verify that there actually is an incident. If you get reports that something is going on, always verify them to the extent reasonably possible. In many cases, you'll find that reports are well-intended, but often wrong. However, always thank people for reporting and encourage them to keep doing it. You will never want to shut down folks; it is better to get 100 reports that were unfounded than miss the one that isn't.

3) Put someone in charge. Somebody needs to be put in charge of a scene. That person tells others what to do. Anyone who is not in charge should NOT initiate response actions on their own. This is the hardest one of all. Most technical folks are type A personalities who feel the need to be in control. Yielding that control to somebody else is hard, but doing so ensures that nobody is put in harms way, that no unnecessary effort is made, and that all necessary steps are taken. The National Incident Management System does a pretty good job at describing a structure to handle emergencies.

4) Secure the area. Whether you are dealing with a physical emergency or with a cyber emergency, securing the affected area is a necessary prerequisite for containment. Securing the area includes sending people who are not directly involved in response on their way, making sure that all persons are physically safe (and stay that way), and protecting property.

5) Contain the badness. Stop the situation from getting worse. In this example, it is as simple as shutting off the water supply. In other systems, it may be transferring live traffic to a secondary server, shutting down a system, or null-routing certain IP space.

6) Eradicate. Make the problem go away. In our flood scenario, we had a plumbing crew come in to replace a cap that had let go. In a server compromise, it may require a full system rebuild and data restoration from known-good medium, or a thorough malware removal exercise. Your mileage may vary.

7) Restore. Go back to a normal situation. In our case, the pipes were repaired, carpets dried out, sheetrock replaced and walls repainted. Always continue to watch for continued signs of trouble: as good as a job you may have done to eradicate the problem, it is easy enough to miss something small, or to accidentally not address the root cause.

8) Learn. One things are humming along nicely, go back and find out what you can do to make things better for the future. Looking back to place blame is unproductive.

Each of these steps has a distinct set of tactics associated with them. For example, when receiving a situation of a potentially dangerous situation, a tactic of keeping a distance to assess further risk and damage is probably wise. It is easy enough to slip in water. When containing a situation, messing around with electrical equipment in the middle of a flood doesn't make things better. In order to restore a backup from known-good, you a) need to have a backup, b) be able to read it, c) known when badness started and d) have archives that go back far enough.

Having an incident response strategy, and people trained in executing that strategy, is paramount.

Remember that, often, sending people out of harms way a good initial response. Removing unnecessary people from the equation without making them feel undervalued reduces chaos and complexity. It also ensures that 'need-to-know' is maintained. However, sending people out of harms way also requires established tactics: making sure that supervisors account for their reports, as well as for their areas guest and contractors is a good plan.

All of this comes down to preparation: plan for the worst, validate the plan through exercises, and train people in the tactics.


  1. Very good Kees. I'm pleased you noted the 'learn' bit, often forgotten, and mentioned exercises which are arguably the best way to train people in the special processes that accompany serious incidents. But what about resilience? If, say, the flood had happened in the dead of night with nobody around to respond in good time, leading to a power short and complete shutdown, would you have lost essential IT services, or would they have continued (albeit perhaps at a degraded level) from another location?

  2. Well-- there is good news and there is bad news on the resilience front. The bad news is that if this had happened after office hours, it would probably have taken hours to be detected, and, consequently, the impact would have been far worse.

    However, since we are in a lower level, we are a little bit prepared. Specifically, our on-site Public Safety does walkthroughs at least twice a night. They would have caught this without a doubt. Secondly, there are several water sensors under the floor; unfortunately, they were under the floor on the other side of the building, and it would have taken the water quite some time to reach them and trigger the alarm.

    As far as resilience goes: we do have plenty of backup power (UPS + large diesel generators), but if your outlets are compromised, that doesn't help much. What happened in Amsterdam earlier this week illustrates that.

    Then we come to secondary processing capabilities; our primary data center is indeed located in the basement (don't ask), but we have a secondary server room in a different building, on the second floor. Furthermore, we do leverage Amazon's Web Services for business continuity purposes.

    If it would have come to transitioning services from the primary data center to a secondary location, it would have had an (acceptable) impact on our users. The biggest problem with failing over to secondary processing is that we also, at some point, have to transition back to primary processing. And, transitioning back to primary systems is in many cases far harder that failing over to secondary ones.