CN102681930A - Chip-level error recording method - Google Patents

Chip-level error recording method Download PDF

Info

Publication number
CN102681930A
CN102681930A CN2012101492112A CN201210149211A CN102681930A CN 102681930 A CN102681930 A CN 102681930A CN 2012101492112 A CN2012101492112 A CN 2012101492112A CN 201210149211 A CN201210149211 A CN 201210149211A CN 102681930 A CN102681930 A CN 102681930A
Authority
CN
China
Prior art keywords
error
register
mistake
local
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101492112A
Other languages
Chinese (zh)
Other versions
CN102681930B (en
Inventor
乔英良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201210149211.2A priority Critical patent/CN102681930B/en
Publication of CN102681930A publication Critical patent/CN102681930A/en
Application granted granted Critical
Publication of CN102681930B publication Critical patent/CN102681930B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a chip-level error recording method. By using a chip designed by utilizing such a method, an error register aggregate of a hierarchical organization is used for recording an error; the error is reported to an external system through ways of sending interruption and enabling an error pin and the like towards other components in the system; when the error is recorded, the errors at different serious levels can be distinguished and recorded and can be configured as error records enabling or shielding some type as required, wherein a local error register aggregate and a global error register aggregate which are used for recording the error organize and record the error by adopting a hierarchical structure; the local error register aggregate is used for recording the error corresponding to some specific component in the chip; and the global error register aggregate is used for summarizing the error records in each local error register aggregate, and reports the error records to the external system.

Description

A kind of chip-scale error logging method
Technical field
The present invention relates to the computer chip design field, be specifically related to a kind of chip-scale error logging method.
Background technology
Along with the high speed development of applications such as science calculating, commercial service, government function electronizations, the user is also increasingly high to the requirement of aspects such as the performance of ICT infrastructure such as server, storage and the network equipment, capacity, density, availability, security.And chip is as the basic composition unit of these ICT infrastructure, and its importance is self-evident.In chip design research and development and throwing sheet production run, chip yield is the principal element that influences R&D cycle, design cost, production cost.The repeatedly throwing sheet of chip has not only increased design, checking, test period; And expensive throwing sheet production cost also causes the cost allowance in the entire chip research and development production run; Therefore in early days just should make up prototype system fast in the chip logic design and carry out chip-scale and system-level checking and test, to guarantee once to throw the sheet success ratio.This just needs a kind of method of Efficient and Flexible chip-scale error logging so that in the testing authentication process positioning chip level problem.
Summary of the invention
The purpose of this invention is to provide a kind of chip-scale error logging method.
The objective of the invention is to realize that local error register and global error set of registers adopt hierarchical structure to come tissue registration's mistake: use the corresponding mistake of the concrete parts of the incompatible record chip internal of local error register set by following mode; Use the incompatible error logging that gathers in each local error set of registers of global error register set, and to outside System Reports;
The local error set of registers comprises 1) local error status register, 2) local error control register, 3) the serious grade register of local error, 4) part error log register, 5 first) local follow-up error log register; Wherein:
1) local error status register identifies the every kind of mistake that takes place in the corresponding component, and every kind of wrong 1bit that uses representes, when certain type wrong taken place, bit corresponding in the register was put 1;
2) local error control register; Whether control writes down certain type error that the corresponding component mistake produces; Each position of its bit definition and local error status register is corresponding one by one; If certain control bit in the local error control register is set, then detected corresponding error meeting conductively-closed is not write down and is handled;
3) the serious grade register of local error; The mechanism that is mapped to certain mistake the serious grade of multiple mistake is provided; When taking place, the correspondence mistake can carry out error reporting according to the definition of the error type in the serious grade register of mistake-serious grade mapping relations; Supposing needs to support following 3 kinds of serious grades of mistake: (1) can right the wrong, and the system of being meant can recover and not have losing of information, need not the mistake of the participation of software; Comprise the link crc error, can retransmit through link layer and correct; (2) recoverable error is meant can't need the mistake of recovering through upper layer software (applications) through the hardware mechanisms corrigendum; (3) fatal error; Be meant that possibly to cause specific affairs unreliable; But the mistake that system still can normally move; Comprise mistake that the ECC of the data division that only influences affairs can not correct, the mistake that can't correct or recover through hardware or software, possibly require system reset to return to the mistake of reliable state, comprise that cache multidigit mark is wrong, permanent PCI-E link failure;
The serious grade that every kind of type of error is corresponding need represent with two bit, establishes that the 00b correspondence can be righted the wrong, the corresponding recoverable error of 01b, the corresponding fatal error of 10b, 11b keep use;
4) part error log register first, the corresponding information when being used for writing down certain mistake of corresponding component and being detected first comprises message content, misaddress;
5) local follow-up error log register, the corresponding information when being used for writing down certain wrong follow-up generation removing for the first time of corresponding component comprises error count;
The global error set of registers comprises global error status register 1), global error control register 2), overall situation error log register 3 first), overall follow-up error log register 4), system event status register 5) and system event control register 6), wherein:
Global error status register 1), identify in the chip and whether make a mistake in each parts, the error condition of each parts uses 1bit to represent, when certain parts made a mistake, bit corresponding in the register was put 1;
Global error control register 2); Whether control writes down the mistake that certain parts produces; Its bit definition and each position of global error status register are corresponding one by one; If certain control bit in the global error status register is set, then the mistake of detected corresponding component meeting conductively-closed is not write down and is handled;
The overall situation is error log register 3 first) and overall follow-up error log register 4) when writing down each parts respectively and making a mistake first and the field data during follow-up making a mistake;
System event status register 5) the wrong corresponding serious grade of each parts generation of record chip;
System event control register 6) define the mapping relations of serious grade-report manner, the mistake of configurable certain serious grade comprises sending out and interrupts, enables wrong pin to the mode of other assemblies reports of system;
Concrete steps are following:
1) certain parts produces a certain type of mistake;
2) judge whether in " local error control register ", whether to have shielded such mistake,, then do not write down this mistake, finish if shield; Otherwise, the corresponding bit in " local error status register " is set;
3) judge whether this type of mistake takes place first,, then upgrade " part is the error log register first " content if take place first; Otherwise, upgrade " local follow-up error log register " content;
4) configuration in the basis " the serious grade register of local error " is to global report's mistake;
5 judge whether in " global error control register ", whether to have shielded this parts relevant error, if shield, then do not write down this mistake, finish; Otherwise, the corresponding bit in " global error status register " is set;
6) judge whether to take place first this parts relevant error,, then upgrade " overall situation is the error log register first " content if take place first; Otherwise, upgrade " overall follow-up error log register " content;
7) the corresponding wrong serious grade in the renewal " system event status register ";
8), interrupt, enable mode such as wrong pin through sending out to outside System Reports mistake according to the serious grade-report manner mapping relations of configuration in " system event control register ".So far, whole error logging and reporting process finish.
The invention has the beneficial effects as follows: but this possess stratification tissue, and the chip-scale error logging method of flexible configuration; Remedied that traditional die error logging method efficient is low, the deficiency of very flexible; Can shorten chip design, checking, test period and can guarantee effectively that chip once throws the sheet success ratio, thereby have boundless development prospect and high technological value.
Description of drawings
Accompanying drawing 1 is the error register set synoptic diagram of stratification tissue;
Accompanying drawing 2 chip-scale error logging schematic flow sheets.
Embodiment
With reference to the accompanying drawings, content of the present invention is described by the process that realizes the chip-scale error logging method of description in the invention with an instantiation.
The present invention proposes a kind of chip-scale error logging method; Utilize the chip of this method design; Use the error register of stratification tissue to gather to come misregistration, and send out interruption and enable mode such as wrong pin to outside System Reports mistake through other assemblies in system; When misregistration, can distinguish and write down the mistake of different serious grades, and can be configured to enable or shield certain type error logging based on needs.
The part and the overall situation two kinds of error registers set of misregistration of being used among the present invention adopts hierarchical structure to come tissue registration's mistake: wherein, use the corresponding mistake of the concrete parts of the incompatible record chip internal of local error register set; Use the incompatible error logging that gathers in each local error set of registers of global error register set, and to outside System Reports, as shown in Figure 1:
The local error set of registers comprises local error status register, local error control register, the serious grade register of local error, part error log register, local follow-up error log register first.Wherein:
The local error status register identifies the every kind of mistake that takes place in the corresponding component, and every kind of wrong 1bit that uses representes.When certain type wrong taken place, bit corresponding in the register was put 1.
Whether the control of local error control register writes down certain type error that the corresponding component mistake produces, and each position of its bit definition and local error status register is corresponding one by one.If certain control bit in the local error control register is set, then detected corresponding error meeting conductively-closed is not write down and is handled.
The serious grade register of local error provides the mechanism that is mapped to certain mistake the serious grade of multiple mistake, can carry out error reporting according to the definition of the error type in the serious grade register of mistake-serious grade mapping relations when the correspondence mistake takes place.Supposing needs to support following 3 kinds of serious grades of mistake: can righting the wrong, (system can recover and not have losing of information, need not the participation of software.Like the link crc error, can retransmit through link layer and correct), recoverable error (can't be through hardware mechanisms corrigendum, need the mistake recovered through upper layer software (applications), possibly cause specific affairs unreliable, but system still can normally move.The mistake that can not correct like ECC; Only influence the data division of affairs), fatal error (can't correct or recover through hardware or software; Possibly require system reset to return to reliable state, like cache multidigit mark mistake, permanent PCI-E link failure etc.).The serious grade that then every kind of type of error is corresponding need represent with two bit, can establish that the 00b correspondence can be righted the wrong, the corresponding recoverable error of 01b, the corresponding fatal error of 10b, 11b keep use.
The part is the error log register first, and the corresponding information when being used for writing down certain mistake of corresponding component and being detected first is like message content, misaddress etc.
Local follow-up error log register, the corresponding information when being used for writing down certain wrong follow-up generation removing for the first time of corresponding component is like error count etc.
The global error set of registers comprises global error status register, global error control register, the overall situation error log register, overall follow-up error log register, system event status register and system event control register first.Wherein:
The global error status register identifies in the chip and whether makes a mistake in each parts, and the error condition of each parts uses 1bit to represent.When certain parts made a mistake, bit corresponding in the register was put 1.
Whether the control of global error control register writes down the mistake that certain parts produces, and its bit definition and each position of global error status register are corresponding one by one.If certain control bit in the global error status register is set, then the mistake of detected corresponding component meeting conductively-closed is not write down and is handled.
Whether the control of global error control register writes down the mistake that certain parts produces, and its bit definition and each position of global error status register are corresponding one by one.If certain control bit in the global error status register is set, then the mistake of detected corresponding component meeting conductively-closed is not write down and is handled.
The field data of overall situation when error log register and overall follow-up error log register write down each parts respectively and make a mistake first first and during follow-up making a mistake.
The wrong corresponding serious grade that each parts of system event status register record chip take place.
The system event control register defines the mapping relations of serious grade-report manner, and the mistake of configurable certain serious grade is interrupted, enables wrong pin etc. to the mode of other assembly reports of system as sending out.
As described in the summary of the invention, the error logging process among the present invention sees accompanying drawing 2 for details, and concrete steps are following:
1) certain parts produces a certain type of mistake;
2) judge whether in " local error control register ", whether to have shielded such mistake,, then do not write down this mistake, finish if shield; Otherwise, the corresponding bit in " local error status register " is set;
3) judge whether this type of mistake takes place first,, then upgrade " part is the error log register first " content if take place first; Otherwise, upgrade " local follow-up error log register " content;
4) configuration in the basis " the serious grade register of local error " is to global report's mistake;
5) judge whether in " global error control register ", whether to have shielded this parts relevant error,, then do not write down this mistake, finish if shield; Otherwise, the corresponding bit in " global error status register " is set;
6) judge whether to take place first this parts relevant error,, then upgrade " overall situation is the error log register first " content if take place first; Otherwise, upgrade " overall follow-up error log register " content;
7) the corresponding wrong serious grade in the renewal " system event status register ";
8), interrupt, enable mode such as wrong pin through sending out to outside System Reports mistake according to the serious grade-report manner mapping relations of configuration in " system event control register ".So far, whole error logging and reporting process finish.
Except that the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims (1)

1. a chip-scale error logging method is characterized in that, local error register and global error set of registers adopt hierarchical structure to come tissue registration's mistake: use the corresponding mistake of the concrete parts of the incompatible record chip internal of local error register set; Use the incompatible error logging that gathers in each local error set of registers of global error register set, and to outside System Reports;
The local error set of registers comprises 1) local error status register, 2) local error control register, 3) the serious grade register of local error, 4) part error log register, 5 first) local follow-up error log register; Wherein:
1) local error status register identifies the every kind of mistake that takes place in the corresponding component, and every kind of wrong 1bit that uses representes, when certain type wrong taken place, bit corresponding in the register was put 1;
2) local error control register; Whether control writes down certain type error that the corresponding component mistake produces; Each position of its bit definition and local error status register is corresponding one by one; If certain control bit in the local error control register is set, then detected corresponding error meeting conductively-closed is not write down and is handled;
3) the serious grade register of local error; The mechanism that is mapped to certain mistake the serious grade of multiple mistake is provided; When taking place, the correspondence mistake can carry out error reporting according to the definition of the error type in the serious grade register of mistake-serious grade mapping relations; Supposing needs to support following 3 kinds of serious grades of mistake: (1) can right the wrong, and the system of being meant can recover and not have losing of information, need not the mistake of the participation of software; Comprise the link crc error, can retransmit through link layer and correct; (2) recoverable error is meant can't need the mistake of recovering through upper layer software (applications) through the hardware mechanisms corrigendum; (3) fatal error; Be meant that possibly to cause specific affairs unreliable; But the mistake that system still can normally move; Comprise mistake that the ECC of the data division that only influences affairs can not correct, the mistake that can't correct or recover through hardware or software, possibly require system reset to return to the mistake of reliable state, comprise that cache multidigit mark is wrong, permanent PCI-E link failure;
The serious grade that every kind of type of error is corresponding need represent with two bit, establishes that the 00b correspondence can be righted the wrong, the corresponding recoverable error of 01b, the corresponding fatal error of 10b, 11b keep use;
4) part error log register first, the corresponding information when being used for writing down certain mistake of corresponding component and being detected first comprises message content, misaddress;
5) local follow-up error log register, the corresponding information when being used for writing down certain wrong follow-up generation removing for the first time of corresponding component comprises error count;
The global error set of registers comprises global error status register 1), global error control register 2), overall situation error log register 3 first), overall follow-up error log register 4), system event status register 5) and system event control register 6), wherein:
Global error status register 1), identify in the chip and whether make a mistake in each parts, the error condition of each parts uses 1bit to represent, when certain parts made a mistake, bit corresponding in the register was put 1;
Global error control register 2); Whether control writes down the mistake that certain parts produces; Its bit definition and each position of global error status register are corresponding one by one; If certain control bit in the global error status register is set, then the mistake of detected corresponding component meeting conductively-closed is not write down and is handled;
The overall situation is error log register 3 first) and overall follow-up error log register 4) when writing down each parts respectively and making a mistake first and the field data during follow-up making a mistake;
System event status register 5) the wrong corresponding serious grade of each parts generation of record chip;
System event control register 6) define the mapping relations of serious grade-report manner, the mistake of configurable certain serious grade comprises sending out and interrupts, enables wrong pin to the mode of other assemblies reports of system;
Concrete steps are following:
1) certain parts produces a certain type of mistake;
2) judge whether in " local error control register ", whether to have shielded such mistake,, then do not write down this mistake, finish if shield; Otherwise, the corresponding bit in " local error status register " is set;
3) judge whether this type of mistake takes place first,, then upgrade " part is the error log register first " content if take place first; Otherwise, upgrade " local follow-up error log register " content;
4) configuration in the basis " the serious grade register of local error " is to global report's mistake;
5 judge whether in " global error control register ", whether to have shielded this parts relevant error, if shield, then do not write down this mistake, finish; Otherwise, the corresponding bit in " global error status register " is set;
6) judge whether to take place first this parts relevant error,, then upgrade " overall situation is the error log register first " content if take place first; Otherwise, upgrade " overall follow-up error log register " content;
7) the corresponding wrong serious grade in the renewal " system event status register ";
8) according to the serious grade-report manner mapping relations of configuration in " system event control register ", interrupt, enable mode such as wrong pin through sending out to outside System Reports mistake, so far, whole error logging and reporting process finish.
CN201210149211.2A 2012-05-15 2012-05-15 A kind of chip-scale error logging method Active CN102681930B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210149211.2A CN102681930B (en) 2012-05-15 2012-05-15 A kind of chip-scale error logging method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210149211.2A CN102681930B (en) 2012-05-15 2012-05-15 A kind of chip-scale error logging method

Publications (2)

Publication Number Publication Date
CN102681930A true CN102681930A (en) 2012-09-19
CN102681930B CN102681930B (en) 2016-08-17

Family

ID=46813895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210149211.2A Active CN102681930B (en) 2012-05-15 2012-05-15 A kind of chip-scale error logging method

Country Status (1)

Country Link
CN (1) CN102681930B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133751A (en) * 2014-08-06 2014-11-05 浪潮(北京)电子信息产业有限公司 Chip debugging method and chip
CN104407952A (en) * 2014-11-12 2015-03-11 浪潮(北京)电子信息产业有限公司 Method and system for debugging through multi-CPU (central processing unit) node controller chip
WO2015196941A1 (en) * 2014-06-24 2015-12-30 华为技术有限公司 Method for processing error catalogs of node in cc-numa system and node
CN110399317A (en) * 2019-07-15 2019-11-01 西安微电子技术研究所 A kind of multifunctional controller that the software of embedded system is adaptive
CN113832663A (en) * 2021-09-18 2021-12-24 珠海格力电器股份有限公司 Control chip fault recording method and device and control chip fault reading method
US11385952B2 (en) * 2018-11-28 2022-07-12 Intel Corporation Apparatus and method for scalable error detection and reporting

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050219886A1 (en) * 2003-11-06 2005-10-06 Kyoji Marumoto Memory device with built-in test function and method for controlling the same
CN101599812A (en) * 2008-06-04 2009-12-09 富士通株式会社 Data transmission set
CN102073533A (en) * 2011-01-14 2011-05-25 中国人民解放军国防科学技术大学 Multicore architecture supporting dynamic binary translation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050219886A1 (en) * 2003-11-06 2005-10-06 Kyoji Marumoto Memory device with built-in test function and method for controlling the same
CN101599812A (en) * 2008-06-04 2009-12-09 富士通株式会社 Data transmission set
CN102073533A (en) * 2011-01-14 2011-05-25 中国人民解放军国防科学技术大学 Multicore architecture supporting dynamic binary translation

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196941A1 (en) * 2014-06-24 2015-12-30 华为技术有限公司 Method for processing error catalogs of node in cc-numa system and node
US9652407B2 (en) 2014-06-24 2017-05-16 Huawei Technologies Co., Ltd. Method for processing error directory of node in CC-NUMA system, and node
CN104133751A (en) * 2014-08-06 2014-11-05 浪潮(北京)电子信息产业有限公司 Chip debugging method and chip
CN104407952A (en) * 2014-11-12 2015-03-11 浪潮(北京)电子信息产业有限公司 Method and system for debugging through multi-CPU (central processing unit) node controller chip
US11385952B2 (en) * 2018-11-28 2022-07-12 Intel Corporation Apparatus and method for scalable error detection and reporting
US11704181B2 (en) 2018-11-28 2023-07-18 Intel Corporation Apparatus and method for scalable error detection and reporting
CN110399317A (en) * 2019-07-15 2019-11-01 西安微电子技术研究所 A kind of multifunctional controller that the software of embedded system is adaptive
CN110399317B (en) * 2019-07-15 2020-12-25 西安微电子技术研究所 Software self-adaptive multifunctional controller of embedded system
CN113832663A (en) * 2021-09-18 2021-12-24 珠海格力电器股份有限公司 Control chip fault recording method and device and control chip fault reading method
CN113832663B (en) * 2021-09-18 2022-08-16 珠海格力电器股份有限公司 Control chip fault recording method and device and control chip fault reading method

Also Published As

Publication number Publication date
CN102681930B (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN102681930A (en) Chip-level error recording method
Gunawi et al. Fail-slow at scale: Evidence of hardware performance faults in large production systems
US9495233B2 (en) Error framework for a microprocesor and system
US11501800B2 (en) Hard disk fault handling method, array controller, and hard disk
CN102272731A (en) Apparatus, system, and method for predicting failures in solid-state storage
CN102216904B (en) Programmable error actions for a cache in a data processing system
CN107015890B (en) Storage device, server system having the same, and method of operating the same
CN103680639B (en) The periodicity of a kind of random access memory is from error detection restoration methods
CN105468484A (en) Method and apparatus for determining fault location in storage system
CN102521058A (en) Disk data pre-migration method of RAID (Redundant Array of Independent Disks) group
US10095570B2 (en) Programmable device, error storage system, and electronic system device
EP3054626B1 (en) Data processing method and device for storage unit
CN102915260B (en) The method that solid state hard disc is fault-tolerant and solid state hard disc thereof
CN103092728A (en) Recovery method and recovery device of abrasion errors of nonvolatile memory
CN102819480A (en) Computer and method for monitoring memory thereof
CN100429626C (en) Information processing apparatus and error detecting method
US20160132382A1 (en) Computing system with debug assert mechanism and method of operation thereof
CN203097882U (en) High-precision pressure gauge for underground data acquisition
CN107807862A (en) Detect the method, apparatus and server of hard disk failure point
CN106354580A (en) Data recovery method and device
CN102750194A (en) Large-scale integrated circuit level error recording and responding method
CN103390429B (en) The online test method of a kind of hard disk and server
CN101814046A (en) Dual redundant bus synchronizing and voting circuit based on programmable device
CN110618891B (en) Solid state disk fault online processing method and solid state disk
CN104407952A (en) Method and system for debugging through multi-CPU (central processing unit) node controller chip

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant