Take a shot at some SOC Playbooks

egrizzly · May 2020

Just going by experience which steps do y'all think should go into these SOC playbooks? The use cases are from the top 10 you would usually find in an average-sized SOC.

#1 Abnormal number of failed login attempts.

#2 Abnormal Number of user accounts created.

#3 Abnormal Number of Distinct Emails Deleted.

#4 Abnormal Number of Distinct Emails Archived.

#5 Unauthorized user account creation.

#6 Robotic Pattern Observed - Failed Authentication

JDMurray · May 2020

The structure of your SOC playbooks will mirror the incident handling framework (e.g., NIST) used by your SOC. Your ticketing (or incident workflow) system will also mirror this same framework for case management and documentation. The framework will have the broad categories (e.g., detection, triage, analysis, containment, eradication, recovery, post-incident activities) that are used to create your playbook procedures and your ticketing and documentation systems structure.

Based on your six examples, you will be writing response procedures to service automated threat alerting from a SIEM or alerting events generated by a security appliance. The playbook will contain information on how to interpret the alert and its possible causes, what additional information to collect that was not provided by the alert (i.e. triage), how to correlate the information with other alerts and gather additional information (from both technology and humans) to give the analyst a better situational awareness of system and network activity at the time of the possible incident.

The actual playbook procedures depend greatly on the structure and content of the organization that the SOC monitors. Using #1 as an example, what system is generating the failed login attempt events (e.g., UNIX/Linux or Active Directory)? Is the system signalling an invalid login attempt also alerting that an invalid login threshold has been reached or is a rule in a SIEM keeping track of this alerting threshold? What does your environment consider an "abnormal" number of login events? Does this threshold change for for different systems in your organization? How many different types of systems will be generating these events?

You must also consider how much work you are creating for your (few) SOC analysts. If failed login attempts are a common occurrence in your organization, and your playbook requires the SOC analyst to make contact with the human operator of a system to ask about the event, and this happens 20 times a day and with a 99.9% non-malicious finding, you will quickly burn-out your analysts unless you either tune your alerting threshold or create automation to handle the bulk of this alert for them.

Finally, I would suggest keeping response procedures for alerts produced by technology (e.g., SIEM) and humans in one set of playbooks and procedures for threat hunting in a separate set of playbooks. Many SOCs these days are trying to work both security alerts and perform threat hunting; threat response and threat hunting are more different than they are similar.

egrizzly · May 2020

Great insights as always @JDMurray .... You provided a great overview that applies to all SOCs. Can you pls give it your best shot given your own experience as a security analyst.

JDMurray · May 2020

In a hypothetical organization the generic response procedures for a SOC analyst might be:

#1 Abnormal number of failed login attempts.

Contact the user of the account and inquire why this situation occurred.
If the user fails to respond, contact the user's manager.
If the user has no knowledge of this situation, contact the system's admin.
If the admin has no knowledge of this situation, engage the IR team to start an incident investigation.

#2 Abnormal Number of user accounts created.

Contact the admin(s) who have the privilege to perform this operation and inquire why this situation occurred.
If the admin fails to respond, contact the admin's manager.
If the admin has no knowledge of this situation, engage the IR team to start an incident investigation.

#3 Abnormal Number of Distinct Emails Deleted.

Contact the user/admin performing this operation and inquire why this situation occurred.
If the user fails to respond, contact the user's manager.
If the admin has no knowledge of this situation, engage the IR team to start an incident investigation.

#4 Abnormal Number of Distinct Emails Archived.

Contact the user/admin performing this operation and inquire why this situation occurred.
If the user fails to respond, contact the user's manager.
If the admin has no knowledge of this situation, engage the IR team to start an incident investigation.

#5 Unauthorized user account creation.

Contact the admin(s) who have the privilege to perform this operation and inquire why this situation occurred.
If the admin fails to respond, contact the admin's manager.
If the admin has no knowledge of this situation, engage the IR team to start an incident investigation.

#6 Robotic Pattern Observed - Failed Authentication

Contact the admin(s) of the system that is the source of this activity and inquire why this situation occurred (or is still occurring).
If the admin fails to respond, contact the admin's manager.
If the admin has no knowledge of this situation, engage NetOps to make a determination.
If NetOps has no knowledge of this situation, engage the IR team to start an incident investigation.

You can see why being a SOC analyst is not a "sexy" job and most analysts move on to something more creative/challenging after a couple of years.

egrizzly · May 2020

Well, you gave it a shot @JDMurray .... All your steps are 100% what you do to rule out a false positive. I guess some SIEM queries would've been in order too but I guess those would fall into the start an investigation part of it.

JDMurray · May 2020

SIEM queries are in the Analysis stage of the initial investigation. You might also use SIEM queries to gather Triage information, but ideally your SIEM's content provides all of the preliminary triage information that the SOC analysts needs to jump right into the Analysis stage.

In your playbooks, it is important to help the analyst discern between "false positive" events and "misuse" and "Business As Usual" (BAU) activity. A false positive is the detection of a condition which causes a security alert but the condition never actually occurred, or the condition is not an actual security concern. False positives are therefore caused by an error in the logic of SIEM rules, misinterpretation of event information, incorrect threshold levels, etc. and can be "tuned-out" so as never to occur in the SIEM alerting.

Misuse is generally divided into three causes: accidental, negligent, or malicious activity. BAU is normal activity that, in some context, can be detected as misuse, such as typical system/activity activity that occurs during a change control, or activity that occurs very infrequently, such as a batch job that is run only once-a-month (i.e., too infrequently to be remembered by behavioral security as a BAU activity). Documenting Lessons Learned is invaluable for preserving this tribal knowledge of such occurrences within the org that the SOC is monitoring.

LonerVamp · May 2020

Which playbooks you create and the steps in them will be determined by your capabilities. The the latter, I'd create playbooks on the things that come in the most and that match any incident response function or policies or concerns you have. An early one many orgs do is handling phishing emails, since they happen daily. Another one is a ransomware playbook, because that's a huge IR and business concern.

If you're looking for the steps, that will be sort of the same all the time (detect, identify, contain, response...) but the details will be different for every org depending on their tools and infrastructure.

For your early ones, I'd keep them useful and common and easy. I will tell you tracking down failed logins becomes a low priority pretty quickly once you've gone through 1 that is untraceable without huge effort or you go through 3 service accounts that keep failing and SMEs don't know wtf they're doing to stop it.

Now, in your list, there is a nuance here. You also can have effort to identify what security events are important to you. And then go through the process to figure out if you can gather information to create that event (usually logs into a SIEM and then rules), then figure out what authority you have to fix any events or qualify them into true incidents, and then...well...actually ask if you get true security value for the effort or if it is just busywork. For instance, external port scans against the firewall can seem important to know, and maybe they are depending on your risk profile. But for most, it's a non-starter and a waste of time.

And some of them can be a beast to track down, especially if you have poor internal controls, poor change management, and poor documentation of proper process. For instance, I like the idea that you are tracking down unauthorized user account creation...but how do you know what is authorized or not? This might be better effort to match up against ITSM tickets or change requests, and anything outside of that intersection becomes an event.

Good efforts, and welcome to the rabbit hole.

Take a shot at some SOC Playbooks

Comments