Causal Diagrams

Incidents often result from contributing factors as opposed to a singular root cause. As a result, causal diagrams can be an effective tool for illustrating incidents.


This is an example of an incident impacting availability of a service endpoint:

graph TD A(Instance Terminated) B(New Instance Health Check Passes & Recieves Traffic) C(#6 External Service Connection Fails) D(Instance Endpoint Returns Error) F(#5 External Service Connections In Use) G(Implemented Scaling Policy) E(Purchase External Service Plan With #5 External Service Connection Limit) A --> B --> C --> D E --> F G --> B G --> F F --> C

Tip: Causal diagrams should consist of a graph of linked events that contributed to the incident. These events should be things that happened as opposed to the absence of something.


From the above example we can derive the incident might have been avoided if we removed a contributing factor:

Or broke a link in a sequence:

Address systemic factors:

Tip: 5 Whys can be useful for finding preceeding events.