Uncomplicating cloud Security - Incident Response (Part 6)

This article is the capstone to a 6-part endeavor to condense the security pillar of the AWS Well architected framework. We have spoken about the Security foundations, IAM, Detection, Infrastructure as well as Data Protection, and Incident response is the missing piece of the puzzle. It goes to show just what an amazing body of work the AWS framework is that it took over 12.000 words to make a “summary” in the hopes of providing useful and actionable advice. Hopefully, it goes without saying that by no means do I propose this series of articles to be any sort of substitute for the official, full-length AWS documentation corpus. These blog posts are simply the words and thoughts of your humble correspondent over at Tailwarden.

Purplin in a well protected cloud environment

The importance of the different elements of cloud security has no correlation with the order in which they are organized in the well-architected framework, they are topics that individually combine to deliver to the cloud professional a holistic and comprehensive understanding of how cloud security should be approached.

It’s also noteworthy to point out that an incident can originate from many sources. And from a business value point of view. As an organization, you should be equally as equipped to react to an external offensive as well as some sort of internal change or blunder. What matters is how customers are affected and how effective your team is at balancing the SLO in a comfortable range. Having said that, let’s jump into the topic at hand.

What’s Incident Response?

Incident response is the process of preparing for, identifying, triaging, and responding to incidents that could compromise the security of an organization's systems and data. There are a number of frameworks that could be used such as the NIST Computer Security Incident Handling guide to prepare and inform your incident response approach but additional considerations should be taken when responding to incidents in the cloud. Incidents include security breaches, data breaches, network outages, system failures, and other disruptions to an organization's IT infrastructure.

The goal of incident response is to minimize the impact of incidents on the organization's business operations and protect its systems and data from further harm. To achieve this goal, organizations typically have an incident response plan in place outlining the steps to take in the event of an incident.

Typical steps in an incident response plan include:

  1. Preparation: This involves creating an incident response team, defining roles and responsibilities, establishing procedures and protocols, and conducting training and drills.
  2. Identification: This involves detecting and identifying an incident as soon as possible, determining its scope and impact, and activating the incident response team. Using tools such as Amazon GuardDuty for threat and malicious activity detection. AWS WAF is also an effective managed service to protect web applications and environment.
  3. Triage: This involves prioritizing the response to the incident based on the severity of the impact and the likelihood of success in mitigating it.
  4. Response: This involves taking action to contain the incident, minimize its impact, and prevent it from spreading. This may include isolating affected systems, restoring data or services, and implementing security measures to prevent future incidents. Setting up incident response calls or bridges with the incident response team is crucial. Additionally, create designated chat rooms in Slack for example in which the incident resolution communication efforts are documented, this will serve as an organized document that can easily show the sequence of events. There are even slack tools that enable easy incident response management such as firehydrant.
  5. Recovery: This involves returning affected systems and services to normal operation, and restoring any lost data from backed up data storage. Check out our last article which covered data backup strategies.
  6. Review: This involves reviewing the incident response process, identifying areas for improvement, and updating the incident response plan as needed. Post-mortems are extremely helpful review tools. The difference between simply bouncing back from an attack or outage and actually building resilience is the capacity an organization has to understand and learn from errors or mistakes. If you are new to post-mortem activities, grab this template from Atlassian or this one from PagerDuty.

Additionally, it’s important to couple these steps with automated processes and use methods of redeployment aligned with your level of expertise and tech stack. By understanding the cloud and how your application is built you will then be in the best position to understand where the events and data will need to be acted upon in the case of an incident. It’s important to place these incident response steps inside the context of your team's and your awareness of the cloud in general and your environment in particular, it’s only when the terrain is familiar that the steps can be effective. An analogy that comes to mind is that of you on your holidays in Japan, you take the subway to downtown Tokyo but suddenly the train stops, alarms start ringing and people start frantically heading for the exits. Assuming you are not a Japanese speaker you won’t be able to understand the instructions bellowing from the megaphones and you will quickly find yourself confused, lost, and in need of help.

AI representation of the analogy above

What’s the plan?

Developing an incident management plan is crucial because it helps prepare for and respond to incidents that could compromise the security of their systems and data in a quick systematic and hopefully rehearsed way. The correct assumption is to think that you will more than likely face an incident in your environment at some point, more than likely sooner than you think. And nobody is safe.

Having an incident management plan in place helps minimize the impact of incidents on business operations and protects systems and data from further harm. A well-developed incident management plan should include:

  1. A clear definition of what constitutes an incident.
  2. A process for identifying and triaging incidents.
  3. Procedures for responding to incidents, including steps for containing and mitigating the incident.
  4. Recovery procedures for returning affected systems and services to normal operation
  5. A process for reviewing and improving the incident management plan

Developing an incident management plan requires organizations to consider the types of incidents that could occur, the potential impact on their business operations, and the resources needed to effectively respond to these incidents. It also requires organizations to establish an incident management team, define roles and responsibilities, and conduct training and drills to ensure that team members are prepared to respond to incidents.

Scenario

Most of us have been there, but if you haven’t, this is what it’s like ⬇️

Incident response comic strip

Let’s break down the steps the computer engineers in the comic above went through, the story revolves around a hypothetical situation where a company's web server is compromised by a malicious actor:

  1. Identification: The incident is identified when a security analyst receives an alert from an intrusion detection system such as AWS GuardDuty or AWS CloudWatch indicating that the web server has been compromised. The analyst verifies the alert and determines that the web server has been compromised by a malicious actor who has gained unauthorized access and is attempting to exfiltrate sensitive data.
  2. Containment: The analyst immediately shuts down the web server to prevent the attacker from exfiltrating more data and to prevent the attacker from further compromising the network. The analyst also isolates the web server from the rest of the network to prevent the attacker from moving laterally within the network.
  3. Eradication: The analyst works to remove the cause of the incident by identifying the vulnerability that was exploited by the attacker. The analyst then patches the vulnerability on the web server and scans the server for any malware that may have been installed by the attacker. Any malicious files or processes are removed.
  4. Recovery: The analyst restores the web server from a known good backup and brings the server back online. The analyst also performs a thorough security review of the webserver to ensure that it is properly configured and secured.
  5. Lessons learned: The incident is reviewed to identify what went well and what could be improved in the incident response process. The analyst makes recommendations for improvements such as, for example, implementing two-factor authentication for remote access, regularly reviewing logs, and hardening the web server.
  6. Communication: The incident is communicated to the management and relevant stakeholders, including the incident's cause, impact, and status. The incident is also reported to law enforcement as well as the incident response team if the company has one.

Can’t be ready for game day without training

Training your team in incident response methods is essential to ensure that they are prepared to effectively respond to incidents that could compromise the security of your systems and data. Without proper training, team members may not know what to do or how to proceed during an incident, leading to a slower response and longer downtime. This can result in a greater impact on your organization's business operations and a higher risk of financial loss.

A way of training might involve junior members showing more experienced senior team members as they go through a drill rehearsal if a certain incident takes place. Comprehensive access to well-documented post-mortems on previous incidents is a great bundle or required reading that can be woven into the onboarding process of new team members. An emphasis on senior-to junior-knowledge transfer sessions in the form of rehearsals or pre-mortem sessions can be hugely beneficial.

Take a deep breath, now speak

Effective communication is also crucial during an incident response situation. Without proper communication, team members may be unsure of their roles and responsibilities, leading to confusion and misunderstandings. This can hinder the response and prolong downtime. It is important to establish clear lines of communication within the incident response team, as well as with other stakeholders (e.g., customers, partners, regulators) to ensure that everyone is informed and aware of the situation.

At a previous company, I was most positively influenced by a senior team member who kept their cool and professionalism, during a particularly hairy outage that was directly affecting the logging page for a large sub-section of the customer base, effectively rendering the platform inaccessible to them. He impressively kept his cool when the AWS support team was pulled in and they were equally baffled at the peculiar networking anomaly and needed time to come up with a solution. At all times the team lead was not only calm but had the awareness to hear everybody's opinion and tried to include the whole troubleshooting party by routinely reciting out loud what had been tried, what the running hypothesis was, and what might be possible ways forward. It was through this constant rehashing of the issue out loud over and over that someone finally voices something along the lines, “Have we tried looking at route x on Internet gateway y?” that ended up pointing us in the direction of a positive outcome.

Of course, nobody goes to bed or wakes up hoping to have to scrap the open tasks and tackle a live incident, for those of you who have spent time on an on-call rotation you know this too well. But once we accept that incidents are a matter of time, you can then open the door to correct preparation and planning. When you surround yourself with a well-synchronized and trained team as well as implement thoughtful environment-specific security automation or deployment rollbacks you can learn the language of incident response if you will and then when the time comes you can face the issue like the chap below.

Regardless if you are a Developer, DevOps, or Cloud engineer. Dealing with the cloud can be tough at times, especially on your own. If you are using Tailwarden or Komiser and want to share your thoughts doubts and insights with other cloud practitioners feel free to join our Tailwarden Discord server. Where you will find tips, community calls, and much more.