Service Management with DevOps, SRE, ITIL…
DevOps is a culture and practice that led to collaboration between multiple delivery functions, such as development, architecture, operations, testing and security in order to drive agility.
SRE can be considered an implementation of DevOps principles. SRE is the discipline that combines software engineering and operations to operate large mission-critical systems to drive reliability and availability.
You cannot be Digital without being Reliable
Cultural elements are the same as Agile revolution — proactive, enabler, self-organised, trust, frictionless, collaborative, blameless style of working together.
SRE Principles and Practices
Key principles are (Notice the similarity with DevOps principles CALMS)
- Break down organization silos (Collaboration and Sharing)
- Leverage tooling and automation (Automation )
- Implement gradual changes ( Lean and Agile)
- Consider failure as normal ( Learn from Failure)
- Measure everything (Measure)
In terms of technical practices SRE comes down to
- Leverage Dev and Ops strenghts to build robust system with equal partnership and feedback mechanism. Establish Error budgets and consequences to reduce the friction. Software engineering meets system engineering world to build robust systems that are easy to run and manage.
- Release engineering — CI/CD with small and frequent changes measured with deployment frequency, cycle time, lead time etc.
- Consider failure as normal by embracing risk. Chaos engineering in production to find and fix gaps. Treat incidents as continuous feedback loops with Blameless post morterms, not repeating failures
- Manage repeated work (toil) with automation and tooling. As systems become complex and highly distributed with growing volume of data, traditional monitoring tools are ineffective. Sophisticated tooling with AI capabilities will be required to enable issue traceability, prediction and anomaly detection.
- SLOs and SLIs — Measure and monitor everything in order to improve. Availability, latency, MTTR, MTTD, MTTF, deployment frequency,
SRE and ITIL
Google considers SRE its approach to Service Management
SRE is the most innovative approach to ITSM since ITIL
SRE is taking a different (Systems engineering) approach to ITSM. Lightweight, Integrated, Self-regulating, Inclusive, Accountable, Proactive, Automated
Systems Engineering to Service Management
- Service Management — SLO and SLI defined up front. SLA is overloaded and can take on different meanings
- Change management — Error Budget, Automation, Increase release velocity. Removing humans minimises human errors such as fatigue, familiarity , repetitive work
- Event Management — Latency, Traffic, Errors, Saturation. Internal and External perspectives into Obersability
- Capacity Mgmt — SRE looks at Organic growth (natural usage) vs inorganic growth (event driven). SRE owns planning and provisioning
- Incident Mgmt — Incident command system, Recognised command post, Live Incident state document, Clear/Live Handoff, On call. An incident commander structurers incident response, assigns responsibility, removes roadblocks, keeps a living state document (ticket/google doc of shared understanding )
- Problem Management — Blameless postmortems, AI driven action identification and closure