piannaf’s avatarpiannaf’s Twitter Archive—№ 1,630

                                                                          1. 1/ I read the Stella Report a while back. Finally getting around to sharing my highlights. If you haven't read it yet, do read the whole thing: snafucatchers.github.io/
                                                                        1. …in reply to @piannaf
                                                                          2/ Without the continuous effort of engineers to keep them running they would stop working -- many in days, most in weeks, all within a year.
                                                                      1. …in reply to @piannaf
                                                                        3/ These platforms remain alive and functioning because workers are able to detect anomalies, diagnose their sources, remediate their effect, and repair their flaws and do so ceaselessly
                                                                    1. …in reply to @piannaf
                                                                      4/ The process of contrasting and shifting perspectives revealed what is otherwise hidden about resilient performances and what is essential to build and sustain the ability to be resilient in the face of surprise in the future
                                                                  1. …in reply to @piannaf
                                                                    5/ Experts are typically much better at solving problems than at describing accurately how problems are solved
                                                                1. …in reply to @piannaf
                                                                  6/ The software and hardware (collectively, the technical artifacts) running below the line cannot be seen or controlled directly. Instead, every interaction crossing the line is mediated by a representation.
                                                              1. …in reply to @piannaf
                                                                7/ When a technical system surprises us, it is most often because our mental models of that system are flawed
                                                            1. …in reply to @piannaf
                                                              8/ Package maintenance routines for the system repository, Chef recipes and the Chef system, and the mistaken belief that installing a single server could not have system-wide side effects interacted to produce the anomaly.
                                                          1. …in reply to @piannaf
                                                            9/ The irony that the system was able to 'limp along' on a handful of servers that continued to run because they were not 'properly' configured was not lost on the operators.
                                                        1. …in reply to @piannaf
                                                          10/ The fact that experts can be surprised in this way is evidence of systemic complexity and also of operational variety.
                                                      1. …in reply to @piannaf
                                                        11/ People are surprised when they find out that their own mental model of The System (in the Figure 1 or Figure 2 sense) doesn't match the behavior of the system
                                                    1. …in reply to @piannaf
                                                      12/ The participants were engaged in a particularly complicated form of search: exploring the external world based on their internal representations of that world, available affordances, and multiple, interacting goals
                                                  1. …in reply to @piannaf
                                                    13/ Experts demonstrated their ability to use their incomplete, fragmented models of the system as starting points for exploration and to quickly revise and expand their models during the anomaly response in order to understand the anomaly and develop and assess solutions
                                                1. …in reply to @piannaf
                                                  14/ Although automation and monitoring provide convenient and efficient ways of doing things and keeping track of nominal performance,
                                              1. …in reply to @piannaf
                                                15/ when things are broken or confusing or when decisive actions are taken, tools that provide tight interaction with the operating system are commonly used
                                            1. …in reply to @piannaf
                                              16/ This coordination effort is among the most interesting and potentially important aspects of the anomaly response
                                          1. …in reply to @piannaf
                                            17/ The postmortem discussions revealed that organizations seek ways to avoid burdening their technical staff with demands for updates and projections, especially in the early stages of anomaly response
                                        1. …in reply to @piannaf
                                          18/ One rationale for improving the quality of postmortems is to obtain better insight into the way that escalating consequences increase the pressure on IT staff and how to better inform their approach to these difficult situations.
                                      1. …in reply to @piannaf
                                        19/ Under 'normal' operating conditions many goals can be active simultaneously and the workers need to do little to maintain a balance between competing or mutually exclusive goals
                                    1. …in reply to @piannaf
                                      20/ sacrifice decisions are readily criticized afterwards and, this is ironically the case, especially when they are successful
                                  1. …in reply to @piannaf
                                    21/ postmortems for events that produce large economic losses or engage regulatory bodies are more scripted, sometimes to the point of being little more than staged events at which carefully vetted statements are made and discussion of certain topics is deliberately avoided
                                1. …in reply to @piannaf
                                  22/ Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them
                              1. …in reply to @piannaf
                                23/ Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately
                            1. …in reply to @piannaf
                              24/ There's the related, but different ‘how-did-this-ever-work?!’ experience that is even more troubling upon discovery.
                          1. …in reply to @piannaf
                            25/ You make a change to restore function to a system but are unable to construct a mental model that would have ever allowed the system to work correctly before you fixed it -- in direct opposition to the observation that the system did appear to be functioning previously."
                        1. …in reply to @piannaf
                          26/ They can also lead to deeper insights into the technical, organizational, economic, and even political factors that promote those conditions
                      1. …in reply to @piannaf
                        27/ The presence and nature of postmortems serves as a signal about the health and focus of the organization and technical artifacts themselves
                    1. …in reply to @piannaf
                      28/ The presence of skilled facilitators -- most often people with technical chops who have devoted time and effort to learn how to manage these meeting and have practiced doing so -- certainly contributes to success
                  1. …in reply to @piannaf
                    29/ Although apparently technically focused, postmortems are inherently social events
                1. …in reply to @piannaf
                  30/ critical but non-judgmental review of events can produce useful insights
              1. …in reply to @piannaf
                31/ Organizations often assert that their reviews are "blameless" although in many instances they are, in fact, sanctionless. As a practical matter, it is difficult to forego sanctions entirely.
            1. …in reply to @piannaf
              32/ A "no blame" approach to managing incidents and accidents is predicated on the idea that the knowledge obtained from open, rapid, and thorough examination of these events is worth more than the gain from castigating individuals
          1. …in reply to @piannaf
            33/ The dilemma facing those already involved is whether they should stay focused on the anomaly in order to maximize their chances of quick diagnosis and repair or devote some of their effort to bringing others up to speed so that they can participate in that work.
        1. …in reply to @piannaf
          34/ steps have been taken, lines of inquiry pursued, diagnostics and workarounds attempted. Coupled to an anomaly that is itself cascading, the activities of initial responders create a new situation that has its own history.
      1. …in reply to @piannaf
        35/ The incoming expert usually needs to review that history
    1. …in reply to @piannaf
      36/ It is far easier to imagine how automation could be useful than it is to produce working automation that functions as a genuine "team player" in anomaly response
  1. …in reply to @piannaf
    37/ Deciding on a risky or expensive course of action, coping with the emotional nature of severe anomalies, and gauging fatigue may be more reliable, efficient, or nuanced with such meetings.
    1. …in reply to @piannaf
      38/ Business critical software presents a unique opportunity for innovative visualizations that improve resilient performance.
      1. …in reply to @piannaf
        39/ The interventions that responders make are experiments that test their mental models of the anomaly sources and the surrounding system
        1. …in reply to @piannaf
          40/ What is not clear is how to manage the risks posed by strange loop dependencies in business-critical software
          1. …in reply to @piannaf
            41/ object-oriented programming method created an opportunity to build systems quickly, to deploy them, and from their use to discover new abstractions that could then be incorporated into the software
            1. …in reply to @piannaf
              42/ Refactoring is not itself productive because it does not change the software's external behavior. Thus refactoring "pays back" technical debt but does not produce immediate value for users
              1. …in reply to @piannaf
                43/ Accepting too much technical debt in order to bring product features to the customer may doom the long-term viability of the product by making it impossible to revise in the future.
                1. …in reply to @piannaf
                  44/ In contrast, concentrating exclusively on keeping the software spotlessly clean may cause the enterprise to miss opportunities for improving the current product and make it less competitive.
                  1. …in reply to @piannaf
                    45/ The organization has little idea of how much technical debt it 'carries' in its code and paying tech debt is notoriously difficult to make visible to those setting business level priorities.
                    1. …in reply to @piannaf
                      46/ There is no specific countermeasure that can be used against dark debt because it is invisible until an anomaly reveals its presence.
                      1. …in reply to @piannaf
                        47/ Critics of the notion of dark debt will argue that it is preventable by design, code review, thorough testing, etc. But these and many other preventative methods have already been used to create those systems where dark debt has created outages
                        1. …in reply to @piannaf
                          48/ "Why are things done the way they are?" is seldom asked during internal analysis but was quite common during the workshop