APP下载

Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions∗

2014-04-24TONGTeng童腾WANGXiaoHui王晓辉ZHANGZhanGang张战刚DINGPengCheng丁朋程LIUJie刘杰LIUTianQi刘天奇andSUHong苏弘

Nuclear Science and Techniques 2014年1期
关键词:刘杰

TONG Teng(童腾),WANG Xiao-Hui(王晓辉),ZHANG Zhan-Gang(张战刚),DING Peng-Cheng(丁朋程),LIU Jie(刘杰),LIU Tian-Qi(刘天奇),and SU Hong(苏弘),

1Institute of Modern Physics,Chinese Academy of Science,Lanzhou,730000,China

2University of Chinese Academy of Sciences,Beijing 10049,China

3Northwest Normal University,Lanzhou,730000,China

Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions∗

TONG Teng(童腾),1,2WANG Xiao-Hui(王晓辉),1,2ZHANG Zhan-Gang(张战刚),1,2DING Peng-Cheng(丁朋程),1,3LIU Jie(刘杰),1LIU Tian-Qi(刘天奇),1,2and SU Hong(苏弘)1,†

1Institute of Modern Physics,Chinese Academy of Science,Lanzhou,730000,China

2University of Chinese Academy of Sciences,Beijing 10049,China

3Northwest Normal University,Lanzhou,730000,China

Single event upsets(SEUs)induced by heavy ions were observed in 65nm SRAMs to quantitatively evaluate the applicability and effectiveness of single-bit error correcting code(ECC)utilizing Hamming Code.The results show that the ECC did improve the performance dramatically,with the SEU cross sections of SRAMs with ECC being at the order of 10−11cm2/bit,two orders of magnitude higher than that without ECC(at the order of 10−9cm2/bit).Also,ineffectiveness of ECC module,including 1-,2-and 3-bits errors in single word(not Multiple Bit Upsets),was detected.The ECC modules in SRAMs utilizing(12,8)Hamming code would lose work when 2-bits upset accumulates in one codeword.Finally,the probabilities of failure modes involving 1-,2-and 3-bits errors,were calcaulated at 39.39%,37.88%and 22.73%,respectively,which agree well with the experimental results.

Single event upsets(SEU),SRAM,Error correcting code(ECC),Hamming code,Effectiveness,Failure modes

I.INTRODUCTION

As technology scales downward in modern integrated circuits,such as SRAM,the minimum charge needed to upset a device within a unit memory cell decreases,while the infl uence of charge sharing on adjacent unit memory cells increases[1–5].Therefore,advanced devices(especially deepsubmicrometer)are much more sensitive to the energy deposition in the device by heavy ion irradiation,and this critically restricts the devices’use in space.

Many methods have been proposed to mitigate the single event upsets(SEUs)occurred in advanced devices.Bits interleaving architecture is a commonly accepted approach to mitigate Multiple Bit Upsets(MBUs)in data word.In this architecture,the bits in a data word are not physically adjacent,but interleaved with bits of other data words.In this way,every MBU of physically adjacent memory cells is transformed into multiple single bit upsets(SBUs)in different memory words.

Error correcting code(ECC)utilizing Hamming code is found commonly in many high-reliability and performance applications.As a relatively simple yet powerful ECC code,it corrects single bit errors anywhere within the codeword.

Therefore,MBUs which is now a major reliability problem in commercial and industrial electronics,can be transformed into multiple SBUs appear to be uncorrelated events relative to the ECC algorithm,and then be corrected[2,5–7].

In this hardening approach,ECC module can be used in high-reliability and performance applications to resolve SBUs combining with the bits interleaving architecture in advanced process node devices.

To observe and compare the SEUs induced by heavy ions in SRAMs of different process,and to quantitatively evaluate the applicability and effectiveness of single-bit ECC utilizing Hamming code in advanced process SRAMs,we used12C ion beam to irradiate four SRAMs from ISSI company.Two of them,manufactured via 130nm and 150nm process,are the most advanced process devices in their SRAMs without ECC module,while the other two are of 65nm process SRAMs with ECC module.Some interesting results were obtained.

II.EXPERIMENTAL BACKGROUND

Four industrial SRAMs,produced by high-performance CMOS technology,were irradiated at normal incidence in the vacuum by12C beams from the Heavy Ion Research Facility in Lanzhou(HIRFL).The12C ions were of effective linear energy transfer(LET)value of 1.8MeV-cm2/mg.Table 1 shows the information of SRAMs under test.The IS2ME is 2M-bit SRAM organized as 131072 words by 16bits with ECC, 65nm process node;the IS4ME is 4M-bit SRAM organized as 262144 words by 16bits with ECC,65nm process node; the IS2M is 2M-bit SRAM organized as 131072 words by 16bits without ECC,150nm process node;and the IS4M is 4M-bit SRAM organized as 262144 words by 16bits without ECC,130nm process node.The f i rst two SRAMs with ECC are the main objects of observation,and the other two are the contrastive devices.All of the four industrial SRAMs belong to the IS61WV series made by ISSI company,and the ECC functions described in this application are made by Hamming code,a relatively simple yet powerful ECC which can correct all single bit errors in one codeword.

The SRAMs were tested using data pattern of all“1”(blanket pattern)at voltage of 3.3V,and the work period was set at 20MHz all the time.Under the static test mode,the devices were written prior to their beam-shot and read periodi-cally throughout the beam shot(this technique is often referred to as multiple-read)[1,8,9].The error data occurred in the test were stored in another RAM(referred as mirrored RAM relative to the SRAM under test)working in the test system, as a referenced data for next read cycle.The test f l ow applied(Fig.1)distinguishes SBU,MBU and SEL.All the upset events were recorded with a timestamp and bitmap location.

TABLE 1.The information of SRAMs under test

Fig.1.Static test f l ow.

III.RESULT ANALYSIS AND DISCUSS

A.The high eff i ciency of ECC module

SEU cross sections of the four SRAMs are shown in Fig.2. One sees that the SRAMs without ECC module are much more sensitive to the irradiation than the devices with ECC module. The SEU cross sections of SRAMs without ECC module are at the order of 10−9cm2/bit,while they are 10−11cm2/bit for SRAMs with ECC module.However,the technology of producing the IS2ME and IS4ME in 65nm process is more advanced than IS2M(150nm process node)and IS4M(130nm process node).With technology scaling,the number of upsets per chip increases due to higher circuit density and sensitivity. Therefore,the sharp contrast of the two datum groups should be attributed to the high eff i ciency of ECC module.

Fig.2.Cross sections of four SRAMs.

Fig.3.The bits per upset event distribution.

B.The ineffectiveness of ECC module

Only 1 bit upsets in a data word were detected in the devices without ECC module in this experiment.The upset events involving 1,2 and 3 bits errors occurred in the devices with ECCmodule.Fig.3 shows the measured and theoretical results of bits per upset event distribution(percentages over total events). Wewilldiscusstheresultswithanemphasis:specialattentions shall be paid to the word“upset”and“error”in the following text—“upset”is the real change occurred in memory cell,and“error”is the data being read out from the memory f i nally.

1.The fundamental reason

For discussing the experimental results,we have the following assumptions:

1.Considering the beam energy of12C and the bits interleaving architecture,the normal incidence ion beams do not affect the adjacent memory cells simultaneously.So, MBUs are not supposed to occur in a codeword any time in this experiment[2,5–7].

2.The static mode used in this test meams that only one write operation worked in a test cycle,while the ECC module does not correct or re-write the memory itself[1],but just corrects the“error”bit(s).When the data be read out through ECC module,the memory remains in upset status until a new write command arrives with new data.Therefore,if other bit(s)upset occurs in the same word,the ECC module utilizing Hamming code,which can only correct one bit error,will lose function.So,the disablement of ECC module is an accumulation effect caused by several SBUs in a word at different time.On the other hand,as ECC functional block diagram(Fig.4,presented in the datasheet of the devices with ECC module)shows,the circuit structure of ECC module utilizes the(12,8)Hamming code in the application.

Based on time structure of the cyclotron and the upstream scanning magnets,the incident ions are of uniform temporal and spatial distribution in the used fl ux range,thus each SBU could be deemed as an independent random event.

In independent random event,if the upset probability isp(p≪1),the probability thatrbit(s)upset occurs in annbits codeword isPn(r)=Cnrpr(1−p)n−r≈.From the results of IS2M and IS4M,about 200 ions could cause 1bit upset in order of magnitude,assuming this probability is suitable for IS2ME and IS4ME,we havep=5×10−3.Then,the probability of two and three SBUs occurring at different time in one codeword is

The probability of three SBUs occurs in different time in one codeword is

The results of Eq.(1)and Eq.(2)show a probability difference of two orders of magnitude betweenr=2 andr=3. Thus three or more SBUs occur at different time in one codeword is of very low probability,hence their omission in this experiment.

Therefore,the fundamental reason for the problem is that a 2bits upset in a codeword causes the disablement of ECC module utilizing(12,8)Hamming code.

2.Parsing the problem

Figure 5 is a basic memory architecture of ECC module utilizing Hamming code[10].Table 2 is a common relationship between syndrome vector and single-error location.

TABLE 2.The relationship between syndrome vector and singleerror location

Assumingthe12bitscodewordis D7D6D5D4P3D3D2D1P2D0P1P0,8 bits data word is vectorDand 4 bits check word is vectorP,the syndrome vectorScan be generated by data word and check word as[11]:

means

and the corresponding(12,8)parity matrix is

在壶关县石坡乡板安窑村见到郭书凤时,她正在自己家的兔舍给小兔喂食。她告诉笔者:“我刚从县城回来,还没吃几口饭就得赶紧喂这些‘小祖宗’,要不然它们就会折腾坏兔舍,饿出毛病来。”虽然这么说,但丝毫掩饰不住内心的喜悦。

Fig.4.Functional block diagram of SRAMs with ECC module.

Fig.5.Basic memory architecture for ECC module utilizing Hamming code.

When an 8-bits data word is written in SRAM,the ECC module generates a 4-bits check word to compose 12-bits codeword and store it in the memory cell.After irradiation,when the data word is read out from memory cell through the ECC module,which generates a syndrome vectorS=(S3S2S1S0),according to the codeword.

InEq.(5),eachcolumnvectorinparitymatrixrepresentsthe position of each bit(Du(u=0,1,...7)or Pv(v=0,1,2,3)) in the codeword,0 means that the bit does not participate in the form of Sk(k=0,1,2,3),1 means that the bit participates in the form of Sk(k=0,1,2,3).Then,how does the 2-bits change in codeword generateS/=(0000),and how does theSpoint to an error in Table 2?The method to f i nd the failure modes is discussed as follows:

1.Neither the 2 upset bits participate in the Sk,Sk=0⊕0=0 to point to“no error”.

2.Both the 2 upset bits participate in the Sk,Sk=1⊕1= 0 to point to“no error”.

3.Only one upset bit participates in the Sk,Sk=1⊕0=1 or Sk=0⊕1=1,the value of the corresponding Skis always 1,so the ECC module spots an“error”and makes a“correct”operation.

Consequently,the Skvalue is associated with the status of 2 upset bits participating in the Sk,and the relationship is a“XOR”operation between Skand the 2 upset bits.

For example,if the 2-bits upset comes from D3P0,they will not affect the value of S0(as both participate in it)and S3(asneither participate in it).However,S1=P1⊕D0⊕D2⊕D3′⊕D5⊕D6and S2=P2⊕D1⊕D2⊕D3′⊕D7will result inS=(S3S2S1S0)=(0110),which can be understood simply as:

Eq.(6)pointstoan“error”positionatD2byTable2,thenECC module corrects the right value of D2to an error value,while the real upset bit D3is read out as an“right”data,leading a 2-bits errors as D3D2.In other words,the data written in is“FF”,and the data read out is“F3”as an error to be detected.

TABLE 3.Message of the failure modes of ECC module of a 2-bits upset with both upsets occurring in check word

3.Analysis results

Extracting two columns of vector from parity matrix of Eq.(4),the total number of error types isC122=66.Tables 3-5 list details of the failure modes and error types.

1.When 2-bits upset are both in chech word(Table 3) In this case,the ECC module makes a wrong operation, the number of error types isC42=6,all the failure mode is 1-bit.

2.When 1 bit upset in check word,1 bit upset in data word (Table 4) In this case,the ECC module would makes a wrong operation,the number of error types isC41C81=32,of which the number of 1-bit is 20,the number of 2-bits is 12,and the failure modes includes 1-bit and 2-bits.

3.When 2 bits upset are both in data word(Table 5) In this case,the ECC module makes a wrong operation, the number of error types isC82=28,of which the number of 2-bit is 13,and the number of 3-bit is 15,and the failure modes includes 2-bits and 3-bits.

TABLE 4.Message of the failure modes of ECC module with 1bit upset in check word and 1 bit upset in data word

Therefore,the total number of 1-bit is 6+20=26,the probability in all error types is 26/66=39.39%;the total number of 2-bits is 12+13=25,the probability in all error types is 25/66=37.88%;and the total number of 3-bits is 15, the probability in all error types is 15/66=22.73%.Table 6 shows the theoretical probabilities of failure modes including 1-,2-and 3-bits agree well with the experimental results.

Therefore,the immanent factor of failure modes of ECC module in this experiment is due to the failure of(12,8)Hamming code facing to 2 bits upset in one codeword.

IV.CONCLUSION

The results show the effectiveness and ineffectiveness of ECC module utilizing(12,8)Hamming code in 65nm process node SRAMs.The ECC module works obviously in hardening the advanced process node SRAMs.The failure modes including 1-,2-,and 3-bits in a data word has been analyzed,and the essential factor of failure modes is due to the failure of(12,8) Hamming code facing to 2 bits upset in one codeword.The measured bits per upset event distribution agree well with theoretical calculation.

TABLE 5.The message of the failure modes of ECC module when 2bits upset occur both in data word

TABLE 6.The measured and calculated probabilities of the failure modes including of 1-,2-and 3-bits

There can be several mitigation approaches if a much higher reliability is required.Periodic memory scrubbing is often used to improve the performance of the device.and a scrubbing operation will be conducted in the SRAMs exposed to heavy ions in our lab,so as to observe the relationship between the scrub-rates and the bit error rate(BER).If more redundancy is accepted,the triple-bit-correcting Golay code or the Triple Modular Redundancy(TMR)may be employed.

The research on 65nm SRAMs may provide a reference to the manufacturers in their choice of the reinforcement model and algorithm,and to the users in their selection of device application environment and methods.

ACKNOWLEDGMENTS

The authors thank LIU Xin and ZHAO Fa-Zhan with Institute of Microelectronics,Chinese Academy of Science,for discussion on the failure modes of Hamming code,and the staff of the HIRFL accelerator,for experiment helps.

[1]Lawrence R K and Kelly A T.IEEE Trans Nucl Sci,2008,55: 3367–3374.

[2]Heidel D F,Marshall P W,Pellish J A,et al.IEEE Trans Nucl Sci,2009,56:3499–3504.

[3]Schrimpf R D,Weller R A,Mendenhall M H,et al.Nucl Instrum Meth B,2007,261:1133–1136.

[4]Liu J,Duan J L,Hou M D,et al.Nucl Instrum Meth B,2006,245:342–345.

[5]Bajura M A,Boulghassoul Y,Naseer R,et al.IEEE Trans Nucl Sci,2007,54:935-945.

[6]Radaelli D,Puchner H,Wong S,et al.IEEE Trans Nucl Sci, 2005,52:2433–2437.

[7]Juan Antonio Maestro and Pedro Reviriego,Study of the Effects of MBUs on the Reliability of a 150nm SRAM Device,DAC’08 Proceedings of the 45thannual Design Automation Conference, p.930-935,California,USA,June 8–13,2008.

[8]Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices, JESD89A,2006,p.10.

[9]Palomo F R,Morilla Y,Mogoll´on J M,et al.Nucl Instrum Meth B,2011,269:2210–2216.

[10]Nicolaidis M.Soft errors in modern electronic systems,Germany,Springer,2011,p.207.

[11]Tam S.Single error correction and double error detection,Xilinx,XAPP645(v2.2),2006.

10.13538/j.1001-8042/nst.25.010405

(Received July 23,2013;accepted in revised form October 8,2013;published online February 20,2014)

∗Supported by the National Natural Science Foundation of China(Nos. 11079045 and 11179003)and the Important Direction Project of the CAS Knowledge Innovation Program(No.KJCX2-YW-N27)

†Corresponding author,suhong@impcas.ac.cn

猜你喜欢

刘杰
自护理论安全教育对脑卒中老年 患者护理安全管理的影响
伺候老婆月子生灵感:研究生奶爸辞职当月嫂
勇气可嘉!硕士奶爸辞职当月嫂
伺候老婆月子生灵感:研究生奶爸辞职当“月嫂”
带民致富的村书记刘杰
借条风波
李梅梅、刘杰作品
猫蛊
当一个周末妻子失去了周末时
一张二寸照