By: M. Adams, J.O. Coplien, R. Gamoke, R. Hanmer, F. Keeve, K. Nicodemus
Published in: PLoPD2
Category: Fault-Tolerant Systems, Telecommunications
Summary: Addresses reliability and human factors issues in telecommunications software.
Addresses reliability and human factors issues in telecommunications software, which must be highly reliable and continuously running.
Downtime, human-induced or otherwise, must be minimized. History has shown that people cause the majority of problems in these systems, so let the machine try to do everything, deferring to a human only as a last resort.
The system must try to recover from all error conditions on its own. To balance automation with human authority and responsibility allow knowledgeable users to override automatic controls.
The human-machine interface is saturated with error reports. Display a message when taking the first action in a series that could lead to an excess number of messages. If the abnormal condition ends, display a message that everything is back to normal. Don't display a message for every change in state. People can't do anything about the messages except watch them anyway. So don't bother printing. This pattern is expanded in Five Minutes of No Escalation Messages [Hanmer+99]
Some errors may be transient. To determine if a problem will work itself out, don't react immediately to detected conditions. Be sure a condition really exists by checking it several times, perhaps using Leaky Bucket Counters
To handle transient faults, keep a counter for each failure group. Initialize the counter to a predetermined value. Decrement the counter for each error or event and increment it periodically (but never beyond its initial value). If the leak rate is faster than the fill rate, then an error condition is indicated.
Give the System Integrity Control Program (SICO) the ability and power to reinitialize the system when system sanity is threatened by error conditions. This program should oversee both the initialization process and the normal application functions so initialization can be restarted if it runs into errors.
The central controller has several configurations with many paths through the subsystems depending on the configuration. To select a workable configuration when there is a faulty subsystem, maintain a configuration counter in hardware and a table that maps from that counter to a configuration state. When the system fails to get through a configuration to a predetermined level of stability, it restarts the system with the configuration that corresponds to the next value of the counter.
You're using Try All Hardware Combos. A latent error can cause a system fault after the configuration counter has been reset. The system then no longer knows that it is in configuration escalation and retries the same configuration that has already failed. The first time the application tells the processor configuration that "all is well," believe it and reset the configuration counter. After that, ignore the request.