Processor shadowing: Maximizing expected throughput in fault-tolerant systems.

*(English)*Zbl 0977.90012Summary: This paper studies parallel processing as a device for increasing fault tolerance. In the first of two basic models, a single job with a given running time is to be run on a finite set of processors; each processor is subject to failure but only while running a job. If a job is running on only one processor, and that processor fails, then the job must be restarted on another processor, assuming not all processors have already failed. To avoid such losses in accrued running time when at least two processors are available, it can be decided at any time to run the job synchronously on two processors in parallel, a replication technique we call shadowing. Clearly, shadowing has its own downside: while two processors are running, the failure rate is doubled. We show how to resolve this trade-off optimally; we devise a policy that schedules shadowing in such a way as to maximize the probability that the job finishes before all processors fail. We prove that the policy is of threshold type. That is, depending on the number of processors and the duration of the job, there is an optimal time to begin shadowing; once started, shadowing continues so long as neither processor fails and the job does not complete. We also show that the thresholds are monotone in the number of processors, i.e., if more processors are initially available, then shadowing should be started sooner.

In the second of our two models, we have the same set-up except that we have an unbounded number of jobs, each having the same running time, and the objective is to maximize the expected number of jobs completed before all processors fail. We show that the optimal policy is again of threshold type, but that the thresholds are, surprisingly, not monotone in the number of processors. The optimal thresholds have a curious oscillatory behavior that we study in detail.

Variants of the above problems are also analyzed using the same methods; several other variants are left as interesting open problems.

In the second of our two models, we have the same set-up except that we have an unbounded number of jobs, each having the same running time, and the objective is to maximize the expected number of jobs completed before all processors fail. We show that the optimal policy is again of threshold type, but that the thresholds are, surprisingly, not monotone in the number of processors. The optimal thresholds have a curious oscillatory behavior that we study in detail.

Variants of the above problems are also analyzed using the same methods; several other variants are left as interesting open problems.

##### MSC:

90B35 | Deterministic scheduling theory in operations research |

90B25 | Reliability, availability, maintenance, inspection in operations research |