Consulting SRE Engagements

This post is dedicated to how I would shape a “consulting” style Site Reliability Engineering (SRE) engagement. While I believe this style an anti-pattern - it makes sense in some circumstances.

SRE is seen as a high modernist project, intent on scientifically managing their systems, all techne and no metis; all SLOs and Kubernetes and no systems knowledge and craft.

Seeing Like an SRE: Site Reliability Engineering as High Modernism

1. Engagement Charter

Start by formalizing the scope and activities as a charter.

e.g. production readiness, operational responsibility, …

2. Critical User Journey Mapping

Discover and document user journeys prioritized by criticality to facilitate the remaining activities.

Critical User JourneyInteractionValid EventImpact
CheckoutGET /checkout/newHTTP 200100%
CheckoutPOST /checkoutHTTP 301100%
CheckoutGET /orders/[id]HTTP 200,4040%
Add to cartPUT /cart/[product_id]HTTP 20010-100%
View ProductGET /products/[id]HTTP 200,4045-100%

3. Risk Analysis

Capture concrete and systemic risks against Critical User Journeys (CUJ). An example of systemic risk might be “production access” or “lack of monitoring”. An example of a concrete risk might be “deployments cause downtime” or a “minor defect”.

RiskETTDETTRImpactETTFIncidents/YearBad Minutes/Year
minsmins%days365/ETTF(ETTD + ETTR) * Impact * Incidents/Year
deployment downtime0 mins3 mins100%7 days52156 mins
minor defect60 mins60 mins2%21 days1741 mins

4. Service Level Objective Development

Figure out which metrics to use as SLIs that will most accurately track the user experience. 80% of the time - this is availability.

See Art of SLOs

Availability %Downtime per year[note 1]Downtime per quarterDowntime per monthDowntime per weekDowntime per day (24 hours)
90% (“one nine”)36.53 days9.13 days73.05 hours16.80 hours2.40 hours
99% (“two nines”)3.65 days21.9 hours7.31 hours1.68 hours14.40 minutes
99.9% (“three nines”)8.77 hours2.19 hours43.83 minutes10.08 minutes1.44 minutes
99.99% (“four nines”)52.60 minutes13.15 minutes4.38 minutes1.01 minutes8.64 seconds
99.999% (“five nines”)5.26 minutes1.31 minutes26.30 seconds6.05 seconds864.00 milliseconds
99.9999% (“six nines”)31.56 seconds7.89 seconds2.63 seconds604.80 milliseconds86.40 milliseconds
99.99999% (“seven nines”)3.16 seconds0.79 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds
99.999999% (“eight nines”)315.58 milliseconds78.89 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds
99.9999999% (“nine nines”)31.56 milliseconds7.89 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds

5. Production Readiness Review

Verify that the service meets accepted standards of production setup and operational readiness.

See Evolving SRE Engagement Model

6. Review Periodically

Requirements will change and new information will become available. Here is some guidance from the SRE Workbook - Implementing SLOs on how to respond to your SLO measures.

SLOToilCustomer satisfactionAction
MetLowHighChoose to (a) relax release and deployment processes and increase velocity, or (b) step back from the engagement and focus engineering time on services that need more reliability.
MetLowLowTighten SLO.
MetHighHighIf alerting is generating false positives, reduce sensitivity. Otherwise, temporarily loosen the SLOs (or offload toil) and fix product and/or improve automated fault mitigation.
MetHighLowTighten SLO.
MissedLowHighLoosen SLO.
MissedLowLowIncrease alerting sensitivity.
MissedHighHighLoosen SLO.
MissedHighLowOffload toil and fix product and/or improve automated fault mitigation.

See SRE Workbook - Implementing SLOs

Useful activities

weekly production in review, runbooks, “Wheel of Misfortune”, pre-mortem, casual maps, human factors, team building activities, …