Loading…
This event has ended. Visit the official site or create your own event on Sched.
Wednesday, October 5 • 9:00am - 9:50am
Orchestrated Chaos: Applying Failure Testing Research at Scale.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites practice Chaos Engineering, running regular failure drills in which faults are deliberately injected in their production system.  While fault injection infrastructures are becoming relatively mature, existing approaches either explore the space of potential failures randomly or exploit the “hunches” of domain experts to guide the search—the combinatorial space of failure scenarios is too large to search exhaustively. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort. 
In this talk, I will present intuition, experience and research directions related to lineage-driven fault injection (LDFI), a novel approach to automating failure testing.  LDFI utilizes existing tracing or logging infrastructures to work backwards from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection.  I will describe LDFI’s theoretical roots in the database research notion of provenance, present results from the lab as well as the field, and present a call to arms for the reliability community to improve our understanding of when and how our fault-tolerant systems actually tolerate faults.

Speakers
avatar for Peter Alvaro

Peter Alvaro

Assistant Professor of Computer Science, UC Santa Cruz
Peter Alvaro is an Assistant Professor of Computer Science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). His research focuses on using data-centric languages and analysis techniques to build and reason about... Read More →


Wednesday October 5, 2016 9:00am - 9:50am CDT
Zilker Ballroom 3+4