Virtually all organizations running proficient information security programs have accepted that attacks are a when - not if – prospect. As such, detecting and responding to threats has emerged as arguably the most critical domain of cyber-risk.
The head of the team responsible for addressing malicious activity at Google wouldn’t disagree. Charged with leading the Security Surveillance Team, Tim Nguyen oversees Google’s massive throng of entry points and attack surface, including 150,000 employees working across every conceivable combination of operating system, operating model and production server.
During a recent appearance on the Cloud Security Podcast by Google, Nguyen pulled back the curtain for a candid and rare peek into the inner workings of one of the planet’s most important detection and response teams. The wide-ranging interview covered an array of topics, from toil to metrics, and featured a slew of insights and perspectives that your security operations team may find immediately valuable. Here are a few of those highlights.
“It’s literally impossible to do our mission without automation.”
Given the massive infrastructure under SST’s watch, building automation into everything it does was part of a deliberate effort to avoid redundancy and ensure humans, who Nguyen describes as “uniquely good at nuance,” are able to make decisions as quickly as possible when a threat arises.
Automation is so core to SST’s work that Nguyen likes to joke the “half-life” of his team wanting to automate something is 30 minutes. “When you find yourself doing something over and over again, that should make you think, ‘Hey we should automate this,’” he said.
Principles of site reliability engineering, which has long been ingrained into the software backbone of Google, is a common theme across SST’s work. It uses common developer frameworks and enterprise tooling built by Google, including BigQuery, to feed in indicator-of-compromise inputs like IPs, domains, and hashes. This automation provides vital enrichment and context that would be impossible to achieve if done manually.
“All of our security engineers need to know how to code, period.”
Of course, to build all the automation that SST relies on, the team must be staffed with engineers replete with robust coding skills, which allow members to be competent in a wide range of responsibilities, including threat modeling, log acquisition, data modeling, signals development, analysis automation and triage/IR. (SST follows the SRE model where its personnel spend 40% of time on operations, 40% on engineering and 20% on various projects.)
Coding skills also enable engineers to share knowledge with each other and ensure clean handoffs from shift to shift. By encoding analytical steps and logic in code, tasks are able to be efficiently absorbed by others.
Somewhere between 95% to 97% of SST’s events are fed through automation, usually resulting in a fast triage via the assignment of risk scores. But Nguyen admitted that there is still ample room for human expertise to run deeper looks and follow up on hunches, “as long as it is building to something larger and adding value.”
“It’s OK for them to take the right amount of time to investigate.”
Automation has clear fiscal benefits. By replicating pieces of logic and presenting information to humans, SST can process many more tickets, driving down cost-per-ticket and cost-per-event and driving up the number of events the team can process. But which metric does Nguyen value above all else?
Again invoking SRE, service-level objectives specify a target level for the reliability of your service. In the case of SST, it’s reducing time to triage–and by association, dwell time–to as close to zero as possible. This is the primary metric by which the team judges its success.
But how does SST ensure its ability to identify and remediate threats, from malware introduction to physical attacks, is done fast–but also done well?
Nguyen readily admits that faster is not always better. Good, he said, is a reflection of manual review and scrutiny, punctuated by a weekly assessment of all cases fielded, during which team members examine latency and ask key questions: Is more training needed? Are enough compute resources available? What programmatic fixes can be introduced?
The team also accounts for variance, and while no event investigation should languish, not every case is easily closeable, especially with the expanding risk surface brought on by massive cloud adoption. “Something what happens in the cloud IAM stack is more complicated than, say, a malware or Flash Helper alert,” Nguyen said. “We want to make sure folks are incentivized and know it’s OK for them to take the right amount of time to investigate.”
Nguyen closed the interview by sharing his biggest tip for other detection and response teams: Convince yourselves–and the broader business–of the value of automation and engineering so you can scale the capabilities of your security operations team. To accomplish that, you must hire the right people who can execute on that ethos.
“Easier said than done,” he admitted.
To hear the full Cloud Security Podcast podcast interview, visit here.