Link to CALCM Home  

Soft Errors in Microprocessors

Tuesday May 11, 2004
Hamerschlag Hall D-210
4:00 pm



Shubu Mukherjee
Intel

With each technology generation, we are experiencing an increased rate of cosmically-induced soft errors in our chips. In the past, the impact of such errors could be minimized through protection of large memory structures. Unfortunately, such techniques alone are becoming insufficient to maintain adequately low error rates. Although, to a very rough approximation, the fault rate per transistor is not changing much, the increasing number of transistors is resulting in an ever increasing raw rate of bit upsets. Thus, we are starting to see a dark side to Moore's Law in which the increased functionality we get with our exponentially increasing number of transistors is being countered with a exponentially increasing soft error rate. This will take increasing effort and cost to cope with.

In this talk I will describe the severity of the soft error problem as well as techniques to estimate a processor's soft error rate. These estimates should help designers choose appropriate error protection schemes for various structures within a microprocessor. A key aspect of our soft error analysis is that some single-bit faults (such as those occurring in the branch predictor) will not produce an error in a program's output. We define a structure's architectural vulnerability factor (AVF) as the probability that a fault in that particular structure will result in an error in the final output of a program. A structure's error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Unfortunately, computing AVFs of complex structures, such as the instruction queue, can be quite involved. To guide such complex AVF calculation, we identify numerous cases, such as prefetches, dynamically dead code, and wrong-path instructions, in which a fault will not affect correct execution. Our simulations using these techniques show that the AVFs of a Mckinley-like microprocessor's instruction queue and execution units are 29% and 9%, respectively.


Shubu Mukherjee is the Director of Intel's FACT group in Hudson, Massachusetts. The Fault Aware Computing Technology (FACT) group is involved with various aspects of soft error measurement, detection, and recovery techniques in current and future machines. In the past, he worked for Digital Equipment Corporation for ten days and Compaq Computer Corporation for three years. In Compaq, he worked on fault tolerance techniques for Alpha processors and was one of the architects of the Alpha 21364 interconnection network. He received his B.Tech. from the Indian Institute of Technology, Kanpur and M.S. and PhD from the University of Wisconsin-Madison. He has received a number of outstanding achievement awards in the past few years.

 

Department of Electrical and Computer EngineeringCarnegie Mellon UniversitySchool of Computer Science