-
1/ I read the Stella Report a while back. Finally getting around to sharing my highlights. If you haven't read it yet, do read the whole thing: snafucatchers.github.io/
-
2/ Without the continuous effort of engineers to keep them running they would stop working -- many in days, most in weeks, all within a year.
-
3/ These platforms remain alive and functioning because workers are able to detect anomalies, diagnose their sources, remediate their effect, and repair their flaws and do so ceaselessly
-
4/ The process of contrasting and shifting perspectives revealed what is otherwise hidden about resilient performances and what is essential to build and sustain the ability to be resilient in the face of surprise in the future
-
5/ Experts are typically much better at solving problems than at describing accurately how problems are solved
-
6/ The software and hardware (collectively, the technical artifacts) running below the line cannot be seen or controlled directly. Instead, every interaction crossing the line is mediated by a representation.
-
7/ When a technical system surprises us, it is most often because our mental models of that system are flawed
-
8/ Package maintenance routines for the system repository, Chef recipes and the Chef system, and the mistaken belief that installing a single server could not have system-wide side effects interacted to produce the anomaly.
-
9/ The irony that the system was able to 'limp along' on a handful of servers that continued to run because they were not 'properly' configured was not lost on the operators.
-
10/ The fact that experts can be surprised in this way is evidence of systemic complexity and also of operational variety.
-
11/ People are surprised when they find out that their own mental model of The System (in the Figure 1 or Figure 2 sense) doesn't match the behavior of the system
-
12/ The participants were engaged in a particularly complicated form of search: exploring the external world based on their internal representations of that world, available affordances, and multiple, interacting goals
-
13/ Experts demonstrated their ability to use their incomplete, fragmented models of the system as starting points for exploration and to quickly revise and expand their models during the anomaly response in order to understand the anomaly and develop and assess solutions
-
14/ Although automation and monitoring provide convenient and efficient ways of doing things and keeping track of nominal performance,
-
15/ when things are broken or confusing or when decisive actions are taken, tools that provide tight interaction with the operating system are commonly used
-
16/ This coordination effort is among the most interesting and potentially important aspects of the anomaly response
-
17/ The postmortem discussions revealed that organizations seek ways to avoid burdening their technical staff with demands for updates and projections, especially in the early stages of anomaly response
-
18/ One rationale for improving the quality of postmortems is to obtain better insight into the way that escalating consequences increase the pressure on IT staff and how to better inform their approach to these difficult situations.
-
19/ Under 'normal' operating conditions many goals can be active simultaneously and the workers need to do little to maintain a balance between competing or mutually exclusive goals
-
20/ sacrifice decisions are readily criticized afterwards and, this is ironically the case, especially when they are successful
-
21/ postmortems for events that produce large economic losses or engage regulatory bodies are more scripted, sometimes to the point of being little more than staged events at which carefully vetted statements are made and discussion of certain topics is deliberately avoided
-
22/ Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them
-
23/ Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately
-
24/ There's the related, but different ‘how-did-this-ever-work?!’ experience that is even more troubling upon discovery.
-
25/ You make a change to restore function to a system but are unable to construct a mental model that would have ever allowed the system to work correctly before you fixed it -- in direct opposition to the observation that the system did appear to be functioning previously."
-
26/ They can also lead to deeper insights into the technical, organizational, economic, and even political factors that promote those conditions
-
27/ The presence and nature of postmortems serves as a signal about the health and focus of the organization and technical artifacts themselves
-
28/ The presence of skilled facilitators -- most often people with technical chops who have devoted time and effort to learn how to manage these meeting and have practiced doing so -- certainly contributes to success
-
29/ Although apparently technically focused, postmortems are inherently social events
-
30/ critical but non-judgmental review of events can produce useful insights
-
31/ Organizations often assert that their reviews are "blameless" although in many instances they are, in fact, sanctionless. As a practical matter, it is difficult to forego sanctions entirely.
-
32/ A "no blame" approach to managing incidents and accidents is predicated on the idea that the knowledge obtained from open, rapid, and thorough examination of these events is worth more than the gain from castigating individuals
-
33/ The dilemma facing those already involved is whether they should stay focused on the anomaly in order to maximize their chances of quick diagnosis and repair or devote some of their effort to bringing others up to speed so that they can participate in that work.
-
34/ steps have been taken, lines of inquiry pursued, diagnostics and workarounds attempted. Coupled to an anomaly that is itself cascading, the activities of initial responders create a new situation that has its own history.
-
35/ The incoming expert usually needs to review that history
-
36/ It is far easier to imagine how automation could be useful than it is to produce working automation that functions as a genuine "team player" in anomaly response
-
37/ Deciding on a risky or expensive course of action, coping with the emotional nature of severe anomalies, and gauging fatigue may be more reliable, efficient, or nuanced with such meetings.
-
38/ Business critical software presents a unique opportunity for innovative visualizations that improve resilient performance.
-
39/ The interventions that responders make are experiments that test their mental models of the anomaly sources and the surrounding system
-
40/ What is not clear is how to manage the risks posed by strange loop dependencies in business-critical software
-
41/ object-oriented programming method created an opportunity to build systems quickly, to deploy them, and from their use to discover new abstractions that could then be incorporated into the software
-
42/ Refactoring is not itself productive because it does not change the software's external behavior. Thus refactoring "pays back" technical debt but does not produce immediate value for users
-
43/ Accepting too much technical debt in order to bring product features to the customer may doom the long-term viability of the product by making it impossible to revise in the future.
-
44/ In contrast, concentrating exclusively on keeping the software spotlessly clean may cause the enterprise to miss opportunities for improving the current product and make it less competitive.
-
45/ The organization has little idea of how much technical debt it 'carries' in its code and paying tech debt is notoriously difficult to make visible to those setting business level priorities.
-
46/ There is no specific countermeasure that can be used against dark debt because it is invisible until an anomaly reveals its presence.
-
47/ Critics of the notion of dark debt will argue that it is preventable by design, code review, thorough testing, etc. But these and many other preventative methods have already been used to create those systems where dark debt has created outages
-
48/ "Why are things done the way they are?" is seldom asked during internal analysis but was quite common during the workshop