Practical System Reliability
Written by a panel of authors with a wealth of industry experience, the methods and concepts presented here give readers a solid understanding of modeling and managing system and software availability and reliability through the development of real applications and products. The modeling and prediction techniques and tools are customer-focused and data-driven, and are also aligned with industry standards (Telcordia, TL 9000, ISO, etc.). Readers will get a clear understanding about what real-world reliability and availability mean through step-by-step discussions of:
- System availability
- Conceptual model of reliability and availability
- Why availability varies between customers
- Modeling availability
- Estimating parameters and availability from field data
- Estimating input parameters from laboratory data
- Estimating input parameters in the architecture/design stage
- Prediction accuracy
- Connecting the dots
This book can be used by system architects, engineers, and developers to better understand and manage the reliability/availability of their products; quality engineers to grasp how software and hardware quality relate to system availability; and engineering students as part of a short course on system availability and software reliability.
Xuemei Zhang, PhD, is a principal member of the technical staff in the Network Design and Performance Analysis Department at AT&T Labs. She has been working on reliability and performance analysis of wireline and wireless communications systems and networks. Her major work and research areas are in system and architectural reliability and performance, product and solution reliability and performance modeling, and software reliability.
Douglas A. Kimber retired from Alcatel-Lucent as a staff reliability engineer. Throughout his career at Bell Labs, Lucent Technologies, and Alcatel-Lucent, he developed high reliability hardware and software platforms, applications, and systems, and then transitioned to reliability engineering where he did reliability modeling and analysis.
Table of Contents
2 System Availability.
2.1 Availability, Service and Elements.
2.2 Classical View.
2.3 Customers’ View.
2.4 Standards View.
3 Conceptual Model of Reliability and Availability.
3.1 Concept of Highly Available Systems.
3.2 Conceptual Model of System Availability.
3.4 Outage Resolution.
3.5 Downtime Budgets.
4 Why Availability Varies Between Customers.
4.1 Causes of Variation in Outage Event Reporting.
4.2 Causes of Variation in Outage Duration.
5 Modeling Availability.
5.1 Overview of Modeling Techniques.
5.2 Modeling Definitions.
5.3 Practical Modeling.
5.4 Widget Example.
5.5 Alignment with Industry Standards.
6 Estimating Parameters and Availability from Field Data.
6.1 Self-Maintaining Customers.
6.2 Analyzing Field Outage Data.
6.3 Analyzing Performance and Alarm Data.
6.4 Coverage Factor and Failure Rate.
6.5 Uncovered Failure Recovery Time.
6.6 Covered Failure Detection and Recovery Time.
7 Estimating Input Parameters from Lab Data.
7.1 Hardware Failure Rate.
7.2 Software Failure Rate.
7.3 Coverage Factors.
7.4 Timing Parameters.
7.5 System-Level Parameters.
8 Estimating Input Parameters in the Architecture/Design Stage.
8.1 Hardware Parameters.
8.2 System-Level Parameters.
8.3 Sensitivity Analysis.
9 Prediction Accuracy.
9.1 How Much Field Data Is Enough?
9.2 How Does One Measure Sampling and Prediction Errors?
9.3 What Causes Prediction Errors?
10 Connecting the Dots.
10.1 Set Availability Requirements.
10.2 Incorporate Architectural and Design Techniques.
10.3 Modeling to Verify Feasibility.
10.5 Update Availability Prediction.
10.6 Periodic Field Validation and Model Update.
10.7 Building an Availability Roadmap.
10.8 Reliability Report.
Appendix A System Reliability Report outline.
1 Executive Summary.
2 Reliability Requirements.
3 Unplanned Downtime Model and Results.
Annex A Reliability Definitions.
Annex B References.
Annex C Markov Model State-Transition Diagrams.
Appendix B Reliability and Availability Theory.
1 Reliability and Availability Definitions.
2 Probability Distributions in Reliability Evaluation.
3 Estimation of Confidence Intervals.
Appendix C Software Reliability Growth Models.
1 Software Characteristic Models.
2 Nonhomogeneous Poisson Process Models.
Appendix D Acronyms and Abbreviations.
Appendix E Bibliography.
About the Authors.
Sign up now »
Enterprise-wide cloud implementation can be a challenging process, requiring a thoughtful, strategic approach. In this whitepaper, IBM® shares considerations for developing enterprise cloud strategies. It looks into how the rapid-scale enterprise- class environment can help enable the type of agile infrastructure that aids organisations in quickly meeting the demands of an ever-evolving marketplace, thereby providing true business value.
iAsset is a channel management ecosystem that automates all major aspects of the entire sales,marketing and service process, including data tracking, integrated learning, knowledge management and product lifecycle management.
- AusCERT 2013: Four dissenters to spur next year's security debates
- AusCERT 2013: Kill the password, says Mozilla
- AusCERT 2013: Unmanaged, unknown privileged logins opening the door for APTs: Cyber-Ark
- AusCERT 2013: Companies unaware of IPv6 security risk even if they’re not using it
- In pictures: AusCERT 2013 roundup
- Analytics and personalisation drive leading marketer behaviour: Report
- Innovation and big data take centre stage during CMO panel
- Twitter targets second screen interaction with Amplify advertising partnerships
- Facebook talks hyper-targeting, analytics and cross-platform at AANA event
- Tapping into social experience: Tourism Australia