Devops is a cultural idea. Site Reliability Engineering (SRE) is a specific group of processes and ideas that tries to implement that cultural idea.

Site Reliability Engineering is an idea mainly invented and promoted by Google. Here’s how they define its most important constituent parts:

Reduce Silos

Instead of having “ops” and “dev” people who work in segregated parts of the development lifecycle, have cross-functional teams that think, design, build, and troubleshoot together. Even within a single team, spread knowledge around so that you don’t have “specialists” at certain parts of the system (with the little silos that this creates).

Accept Failure as Normal

Large, complex distributed systems experience failures even when things are “normal.” You will never ONLY have HTTP 200s being returned in your system. There are so many possible sources of failure that you can’t control or fix them all (many aren’t even inside your own system)! This mindset enables a few positive changes:

  • RATES become more important than raw numbers. This changes your monitoring and alerting to be smarter and more likely to catch actual problems (i.e. you only start caring about dramatic changes in error rate, instead of feeling like every error is something worth troubleshooting).
  • Incidents are just a normal part of life (usually as a result of developers pushing new code). Having blameless postmortems is a big part of this — the goal is not to find out who to blame for an incident, but how to change the system to make such failures less likely in the future.

Build the system as a set of gradual changes

Instead of rolling out a massive release once a month, make many small changes. This goal gets you to do SO MANY THINGS RIGHT:

  • It forces you to get your CI/Deployment system to the point where it’s safe to make multiple small changes every day.
  • It incentivizes writing feature, integration, and end-to-end tests so that multiple developers making changes in quick succession is possible.
  • It incentivizes small chunks of work that are easier to plan and less likely to go down a rabbithole.
  • It ensures that rollbacks are generally painless and safe.
  • and a bunch of other good stuff!

Leverage Tooling and Automation

All of this is predicated on a large amount of tooling and automation throughout the system. From automatically running tests against code changes (BEFORE they get merged into git’s master branch), to automating safe deploys and rollbacks, to monitoring and alerting on potential issues, a massive amount of automation is needed to put together a reliable system like this. Thankfully most of that automation is hand-tuned rather than hand-rolled — much of it is open source or paid 3rd-party tooling that you can quickly hook up to your system and start using.

Measure the System

Because so much of these large systems are automated, you need to observe them in an automated way. But the level of monitoring here is very different from a small system. Instead of caring about CPU utilization on a few nodes, you’re more likely to measure 95th percentile latency numbers. It’s a statistical approach, instead of a simple numerical approach. This dramatically changes WHAT you measure, and HOW you measure it. Part of this is the Service-Level Objectives (SLOs — basically ‘reliability goals’) and the Service-Level-Indicators (SLIs) that you use to measure your fulfillment of those goals.

References:

A YouTube video by Google defining the difference between DevOps and SRE: