Tuesday, 2 January 2024

Resources on Fault Tolerance, Resilience, Self Healing in Software

Fault Tolerance, Resilience, Self Healing in software. Build fault tolerant by design. This lets me keep focusing on velocity.

  1. Patterns for Fault Tolerant Software: A good patterns book.

  2. "Fault-Tolerant Design" by Elena Dubrova: This book provides an in-depth exploration of fault-tolerant design strategies, offering a comprehensive approach to building robust systems.

  3. "Guide to Fault Detection and Diagnosis in Engineering Systems" by Janos J. Gertler: While not exclusively software-focused, this book offers valuable insights into the theory and practice of fault detection and diagnosis, which can be applied to software systems.
  4. "Fault Tolerance: Principles and Practice" by Peter A. Lee, Thomas Anderson: This is a foundational text in the field, discussing the principles and practical considerations of building fault-tolerant systems.
  5. "Reliable Software Technologies - Ada-Europe" (Series of Conference Proceedings): These collections from the annual Ada-Europe conferences contain numerous papers and studies on reliable software technologies, many of which focus on fault tolerance.
  6. "Designing Reliable and Efficient Networks on Chips" by Srinivasan Murali: Tailored more towards network design on chips, this book includes principles that are also applicable to software systems, particularly in terms of reliability and fault tolerance.
  7. "Practical System Reliability" by Eric Bauer, Randee Adams: This guide offers practical tools and strategies for ensuring system reliability, with a focus on real-world applications.
  8. "Building Reliable Component-Based Software Systems" edited by Ivica Crnkovic, Magnus Larsson: This book provides a comprehensive overview of component-based software engineering, including aspects of reliability and fault tolerance.
  9. "Dependable Computing for Critical Applications" (Series of Volumes): A collection of works that focus on dependable and fault-tolerant computing, relevant to both hardware and software systems.
  10. "System Reliability Theory: Models, Statistical Methods, and Applications" by Marvin Rausand, Arnljot Høyland: This textbook covers a broad range of reliability theory topics, providing a statistical approach to system reliability which can be applied to software fault tolerance.
  11. "Robust Communications Software: Extreme Availability, Reliability, and Scalability for Carrier-Grade Systems" by Greg Utas: This book covers the development of communications software with a focus on robustness, availability, reliability, and scalability, all crucial for fault-tolerant systems.
  12. "Fault Tolerance by Design"

These books cover a range of topics from the theoretical foundations of fault tolerance to practical applications in software and systems engineering. They provide valuable resources for anyone looking to deepen their understanding and skills in building fault-tolerant systems.

Wikipedia is a good starting point as well:

https://en.wikipedia.org/wiki/Fault_tolerance

Grasping the basics gives me the nouns to find more/better resources

No comments:

Post a Comment

Parse Wikipedia dump

""" This module processes Wikipedia dump files by extracting individual articles and parsing them into a structured format, ...