How To Run Cyber Incident Response Simulations (Game Days) in AWS

Posted by Ben Potter on Tuesday, December 8, 2020


The best way to learn is hands-on, and the best way to get better at something is to practice. When I speak to security teams I like to ask when the last time they practiced responding to a cyber security incident, and the answer is normally “we’re planning to run one”. The more time you spend on planning, the less precious time you have to practice and learn! An incident response simulation or game day can take anywhere from an hour to multiple days - the amount of time you put into it is up to you.

Why Practice Incident Response?

Besides being a Well-Architected best practice to run game days, you may be required by a regulator or even cyber insurance policy to practice your response to incidents. You should approach a cyber security incident as seriously as you would a physical one, in many cases more seriously. A cyber security incident could be as simple as an employee clicking on an email and getting infected with malware, all the way to an organisation ending event where you could loose control over all systems and data. When your business uses cloud computing, its very important you follow recommendations and best practices from your cloud provider to properly secure your accounts. For example, if you accidentally or in advertently gave someone your login credentials to a cloud account, then whoever has those credentials can do the same as you - like deleting all data or closing the account. With traditional IT you might have set a burglar alarm on your server room or data centre, in the cloud you can do similar things that make cloud more secure than your old server room. You can learn from Code Spaces about how someone took over their cloud account, and deleted all their data. They went on to say they wish they had a runbook on how to do a lockdown, and in this blog post you will learn some steps to improve your responses to a cloud based incident.

Military across the globe run simulations also known as exercises or drills to test theories, processes, and for training. We can take some approaches and learnings from the military and apply them to cyber security. Practicing response through running simulations helps you:

  • Create and maintain plans in the form of runbooks, playbooks, escalation, and others
  • Identify who can help you from your team, others in your organization, and external specialists
  • Identify and test tools to investigate, contain, eradicate, and recover
  • Train and coach team members
  • Encourage cross-team collaboration
  • Meet regulatory or other requirements
  • Build upon problem solving skills


1. Select a Scenario

Select a scenario that could become real for your environment, or has impacted you in the past. e.g. if you have an internet facing web application then a DDoS attack could be a relevant starting scenario. You could also base it on emerging threats such as ransomware. Each time you run a simulation you can pick a new scenario, or choose to re-run an existing one if you want to hone in on skills and improve. If you’re stuck for ideas you can select an Amazon GuardDuty finding that you may have seen before or is most relevant to you.

2. Select a Format

You need to decide if you are going to run the simulation as a workshop (often called a table top exercise) where you discuss the hypothetical situation and run through, or if you are going to do a real simulation on a non-production environment. If it’s your first simulation then starting with a workshop format without getting hands-on is highly recommended. You’ll be able to discuss with each other, whiteboard, and collaborate to identify many gaps and vulnerabilities or areas where you need more visibility. For example, do you have detection if resources are created in AWS regions which you do not use? The workshop helps you get started quickly, instead of playing with tools, tools are cool but not much use if you don’t have a process to use them.

3. Approvals & Collaboration

Depending on the structure of your organization, let management and relevant teams know what you’re planning, and verify that you can run a simulation. Use this as an opportunity to socialize the scenario and the idea of running a simulation with other teams and individuals. This will help you work out who should participate. Remember to treat it as a learning and collaboration exercise where everyone helps each other.

When running a real simulation on AWS you need to review the AWS Customer Support Policy for Penetration Testing. You need to be aware what is allowed if your simulation involves testing activities. There is also a simulated events form if you are planning activities which include additional services.

Permitted services:

  • Amazon EC2 instances, NAT Gateways, and Elastic Load Balancers
  • Amazon RDS
  • Amazon CloudFront
  • Amazon Aurora
  • Amazon API Gateways
  • AWS Lambda and Lambda Edge functions
  • Amazon Lightsail resources
  • Amazon Elastic Beanstalk environments

Prohibited Activities:

  • DNS zone walking via Amazon Route 53 Hosted Zones
  • Denial of Service (DoS), Distributed Denial of Service (DDoS), Simulated DoS, Simulated DDoS (These are subject to the DDoS Simulation Testing policy)
  • Port flooding
  • Protocol flooding
  • Request flooding (login request flooding, API request flooding)

4. Define Goals & Scope

Define the goals you want to achieve in the simulation. If you are doing your first workshop style simulation then your goals may be to simply iterate on your draft runbook, and identify technology and skills gaps for next time. Your scope should include the components in your system that are in the simulation, and how you propose to simulate to meet the objectives. Unless you are performing a specific targeted test of your production environment, minimize the risk of impact on production by performing simulations in non-production environments.

5. Draft a Runbook

A definition of a runbook is sometimes confused with a playbook and interpreted in different ways. Essentially runbooks have a known outcome, e.g. investigate which user logged in and you will find out just that. Playbooks are processes with branching paths and no predefined destination, e.g. if a user does this then you’ll need to check these things and depending what they’ve done you’ll need to dive in deeper. The runbook should start simple and have many iterations, and be easily accessible and consumable in a hurry. If you’re creating a runbook for a environment you’re looking after and know very well, think how you can make it easy for someone without your knowledge to follow. The worst time to work out what you need to do is during an event when everyone is under pressure, and probably not thinking clearly.

There are a few AWS sample runbooks that you can use as a start that cover DDoS, S3 and credentials. You can also create your own starting with the following sections:

Incident Type
Incident Handling
 - 1: Acquire, Preserve, document evidence
 - 2: Contain The Incident
 - 3: Eradicate the Incident
 - 4: Recover from the Incident
 - 5: Post-Incident Activity

6. Identify Mechanisms & Tools

Identify the mechanisms and tools to help with each stage of your runbook. Start with how you’re going to detect the incident, e.g. detective controls like Amazon GuardDuty which is a threat detection service could be the source of your scenario. In AWS, a core service that you need to be familiar with is querying logs of AWS API actions using AWS CloudTrail. I created a Well-Architected Lab with Byron Pogson on incident response for the IAM service that uses Jupyter notebooks to create a fusion of runbooks with text instructions, and code that can investigate and contain an incident.

7. Schedule a Time

Schedule a time in advance that is most suitable for everyone you need involved. If you start with a workshop format and allow a couple of hours you’ll feel like you’ve achieved more as the discussion flows quickly. Allow time for breaks, and ensure you agree on an end time, allowing for discussion straight after, and even a retrospective which I have described in the last section. If you have run a few simulations in the past and doing it hands-on, you will need to allocate more time, especially if its a complex scenario and multiple teams are involved. Depending on your organization and working arrangements, it can be a fun team get together at the end of the week, treat it like a collaborative learning and semi-social event.

Running the Simulation

Using your materials you’ve gotten ready in planning, and at the time you scheduled, it’s now time to have fun! If you have someone experienced in running simulations you could use them as a supervisor or coach. You’ll also want to have a note taker, even someone outside of your team, to take notes on the timing, steps you’ve taken, and general observations. It’s also good to have an observer to capture insights to what the participants did, the challenges they faced, that can feed into your retrospective - especially if you’re running a real simulation. You should iterate on your runbook as you go, otherwise you might forget what those important bits were! Take note of what you spent the most time on, the challenges, and think about what would have happened if it was real.

Iterate & Share

Straight after your simulation, have a short retrospective to share what worked well, areas for improvement, and actions. If you’re in an organization with multiple teams, write a short email report to share the scenario, what you learnt, and the retrospective. If there is an individual that was key to your simulation, you could make them an observer for the next time and test what everyone learnt from them. Often in a real life event, not everyone will be available. Share the runbooks somewhere centrally so different teams can collaborate and save time by re-using. Don’t forget to set a time aside for your next one!

The next step you can take after you have practiced a number of simulations is to investigate how you could achieve the Well-Architected best practice Automate containment capability. Using automation to contain, and even recover from an incident will save you time and reduce human error during an incident.

Thank you to Brian Carlson for your help on this!

Further Reading

AWS Well-Architected - Incident Response
AWS Incident Response Guide
Incident Response Playbook with Jupyter - AWS IAM
Orchestrating a security incident response with AWS Step Functions