Site Reliability Engineering Principles
An opinionated working SRE guide
Alex MengDec 19
What follows are my personal opinions, principles, and guidelines for a mature (site reliability) engineering organization. These opinions are based on my years of experience as both a software engineer and a site reliability engineer. The lack of some Google SRE principles (SLA’s, SLO’s, etc) is intentional. I believe those are sufficiently documented elsewhere, and more importantly, that those concepts require a solid foundation before its possible to implement them. This document is that foundation.
Everything should be completely automated.
- If an existing process cannot be automated, it will be replaced.
- If a proposed process cannot be automated, it will be rejected.
- The SRE’s job is to automate themselves out of a job. In practice this means constantly automating menial tasks and moving on to solve more interesting problems.
Servers are ephemeral. They can and will go away at any time.
- Servers live in auto-scaling groups that self-heal.
- Servers have health checks that assert the health of their process(es).
- Servers boot from images that are fully equipped and operational.
- Configuration management is not run against existing servers. It is used only to create images.
- Application servers are stateless.
Engineers are ephemeral. They can and will go away at any time.
- Engineering workloads are shared. There are no individual silos.
- Engineering practices are documented. Documentation is up to date.
- All engineers have access to all codebases.
All code changes are made via pull requests, verified, and approved.
- All code is functionally tested, unit tested, and linted.
- Linters are extremely opinionated. Engineers should feel empowered to propose changes to the rules in isolated discussions and pull requests.
- Unit tests and linters run on every pull request, preventing merges when the build fails.
- Functional tests run on every deploy, preventing (or rolling back) deploys when the build fails.
Deploys are easy, fast, safe, and frequent.
- Changes are deployed on every merge.
- Deploys do not require any human interaction or approval.
- Deploy time matters and engineers should strive to make it faster.
- Deploys can be started manually with a single button. As many engineers as possible should have access to the button.
- Rollbacks happen automatically when a failed deploy is automatically detected.
- Rollbacks are held to all the same standards as deploys.
- The master branch is the only branch that gets deployed. All git branching is for the benefit of the engineer prior to merging the changes into master.
- It is easy to tell which commit is deployed.
- There is no such thing as a code freeze.
- Features are released by feature flags. Flipping a flag does not require a deploy. A “flip freeze” is acceptable.
SRE’s operate as software engineers, not system administrators.
- Everything is managed in code. Any change to a system is a code change.
- Code is written to be read by other engineers. It is self-documenting.
- All processes are automated with software.
- CI/CD principles apply to all SRE code.
- The entire engineering team has access to all SRE code.
Services are small, well defined, and isolated.
- Services are reasonably small and single purpose. If a service cannot be summarized succinctly, it is too big.
- Services run in isolation. Excessive resource usage in one service does not affect other services.
- They are independently deployable to any environment.
- A service going down affects other services minimally or not at all.
- They do not share data stores.
- Their infrastructure is homogeneous.
- All services are deployed the same way, from the same interface.
- Services communicate with each other through APIs or well-defined pub/sub mechanisms.
- Implementing and deploying a new service is trivial.
- Service discovery is highly available and held to microservice standards.
All systems are monitored for critical metrics.
- Metrics are easily available and consumable in a single interface.
- Critical metrics are displayed on dashboards for each system.
- The system that does the monitoring is monitored by a separate system.
When self-healing fails, engineers are intelligently notified.
- Alerts summarize the problem succinctly and include suggested actions.
- Engineers are only paged off-hours for production. Other environments may alert engineers during business hours.
- After resolving the alert as quickly as possible, the next step (during business hours) is to ensure the same alert never fires again.
- Excessive alerting is unacceptable. It is addressed immediately.
On-call engineers (both SRE’s and SE’s) feel empowered to respond in a timely manner.
- SE’s are on-call for the systems they create and own.
- SRE’s are on-call for low level systems and to assist developers.
- All escalation policies have backups or fallbacks.
- All escalation policies have rotations. No engineer is on-call for a system full time.
- Escalating is acceptable if needed. Escalation generates a follow-up task to understand why the on-call engineer could not solve the problem.
All user-facing incidents require a postmortem.
- Postmortems are blameless.
- The process for a postmortem is easy to conduct and has very little overhead. A few sentences is sometimes sufficient. A meeting is not always required.
- Postmortems are conducted reasonably soon after the incident is resolved.
- A repository of postmortems is easily accessible.
Security is automated and baked into everything.
- Security checks run as part of CI/CD.
- Intrusion detection systems are in place.
- Identity and access management is used to gate all actions.
- As few infrastructure components as possible are publicly accessible, ideally zero.
- Client applications only use public APIs.
- Engineers are trusted but verified.
- Credentials are not stored in plain text, especially not in code.
- Credentials can be easily rotated.
- Access is revoked in a single place, which propagates to all systems.
Offload security to managed services.
- Servers receive requests through managed load balancers.
- All data stores receive requests from inside the network only.
- Static content is delivered through a CDN. Buckets are private.
SRE’s are financially conscious in all aspects of their work.
- Costs measurements include engineering time and effort.
- Tooling is used to monitor all engineering costs an SRE can affect.
An externally managed cloud is the default place to run services. Running services by any other means requires justification.
- Multi-region is appropriate when downtime vs cost is properly measured.
- Multi-cloud (for redundancy) is almost never worth the effort and loss of features.
On-premise solutions are appropriate when:
- A modern cloud front-end is in place (OpenStack, etc).
- IT, capacity planning, and system administration are all top-notch.
- The increased overhead is drastically cost-effective when engineering time is considered, and is projected to remain this way for the foreseeable future.
- SRE’s are not expected to physically interact with the data center.
Containerized orchestration is appropriate when:
- Services are shown to successfully run in containers.
- Services are in a healthy state and sufficiently modularized.
- The increased overhead is deemed acceptable.
- The company is willing to invest heavily in tooling.
Serverless solutions are appropriate when:
- Tooling and automation are used to managed serverless functions.
- Service owners are willing to accept the limitations of serverless.
The default option for supporting services (logging, monitoring, alerting, etc) is externally managed and hosted. Running these services internally requires justification.
- SRE’s are constantly evaluating supporting service options, new and old. The ability to consolidate is a factor.
- Supporting services are secure, cost effective, and useful to engineers.
SRE’s and SE’s are on the same team. They are all engineers.
- SRE’s are not blockers and allow access to as many systems as possible.
- SE’s own their services and do not “throw code over the wall.”
- SRE’s are willing and able to contribute to and debug application code.
- SRE’s use and contribute to open source, if possible.
- SE’s and SRE’s work together to plan new services and architectures.
- SRE’s strive to make the lives of all engineers better through automation.