CN104102563A - Method and device for finding MCA (machine check architecture) errors of server system - Google Patents

Method and device for finding MCA (machine check architecture) errors of server system Download PDF

Info

Publication number
CN104102563A
CN104102563A CN201410327862.5A CN201410327862A CN104102563A CN 104102563 A CN104102563 A CN 104102563A CN 201410327862 A CN201410327862 A CN 201410327862A CN 104102563 A CN104102563 A CN 104102563A
Authority
CN
China
Prior art keywords
mca
error message
server system
mca error
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410327862.5A
Other languages
Chinese (zh)
Inventor
白秀杨
叶丰华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410327862.5A priority Critical patent/CN104102563A/en
Publication of CN104102563A publication Critical patent/CN104102563A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method and a device for finding MCA (machine check architecture) errors of a server system. The method includes reading and storing first MCA error information; according to a preset testing cycle, reading second MCA error information; comparing the second MCA error information with the first MCA error information, and when the second MCA error information is nonuniform with the first MCA error information, exiting the process and reporting an error; otherwise, continuing to execute the steps: reading the second MCA error information and performing comparison according to the preset testing cycle. By the method and the device, the problem of repeated error reporting can be avoided, and the MCA errors can be found and reported timely, so that stability of a server is improved.

Description

A kind of method and device of MCA mistake of discovery server system
Technical field
The present invention relates to server test technical field, espespecially a kind of method and device of machine self-verifying framework (MCA) mistake of discovery server system.
Background technology
Machine self-verifying framework (MCA, Machine Check Architecture) is that a kind of central processing unit (CPU) that intel corporation proposes reports to hard error the server error self-detection mechanism of operating system (OS).The Intel processor of current main flow, as Pentium 4 (Pentium4), strong (Xeon) series of will and Anthem (Itanium) series processors are all supported MCA mechanism.MCA mechanism is mainly to detect and report hard error, as system bus (System Bus) mistake, EMS memory error inspection and correction (ECC) mistake, parity error, buffer memory (Cache) mistake etc.MCA mechanism will be passed through a series of special module registers (MSR, Model Specific Registers) at processor internal main and realize.
When server system operation stability test, the chance that processor and internal memory are made mistakes is larger, can produce the MCA error message of detecting and report hard error for describing MCA mechanism.(SuSE) Linux OS can be by the mcelog file leave in/var/log of the MCA error message syslog file folder obtaining from CPU, and MCA error message mainly comprises core (CPU CORE) mistake and non-core (CPU UNCORE) mistake.If there is the mistake comprising in MCA mechanism in CPU like this, user can find and solve for these mistakes in mcelog file, thereby to avoid these mistakes not solved in time causing delay machine or restart of system, what client was caused to loss of vital data cannot retrieve consequence.How to find timely that MCA mistake is a difficult problem urgently to be resolved hurrily.
The debugging acid of intel corporation (ITP) can be connected with mainboard, directly read the register information of CPU, and check in this information whether have MCA mistake, utilize ITP to find that MCA mistake has caused the problem that repeats to report an error, and can not find MCA mistake timely.
Summary of the invention
For technical solution problem, the invention provides a kind of method and device of machine self-verifying framework (MCA) mistake of discovery server system, can avoid the problem that repeats to report an error, find timely MCA mistake and report an error, larger raising the stability of server.
In order to reach goal of the invention, the invention discloses a kind of method of machine self-verifying framework (MCA) mistake of discovery server system, comprise the following steps:
Read a MCA error message and preserve;
According to the test period setting in advance, read the 2nd MCA error message;
Relatively the 2nd MCA error message and a MCA error message, when the 2nd MCA error message and a MCA error message are when inconsistent, exits this flow process and reports an error; Otherwise, continue to carry out according to the test period setting in advance the step that reads the 2nd MCA error message and compare.
Further, before reading a MCA error message and preserving, the method also comprises: move this server system stability test software.
Further, server system is (SuSE) Linux OS.
Further, server system is used the Intel processor of supporting this MCA mechanism.
Further, a MCA error message and the 2nd MCA error message are stored in mcelog file, in be stored in/var/log of mcelog file syslog file folder.
The device that the invention also discloses a kind of machine self-verifying framework (MCA) mistake of discovery server system, comprising:
The first acquisition module, for reading a MCA error message and preserving;
The second acquisition module, for the test period according to setting in advance, reads the 2nd MCA error message;
Comparison module, for relatively the 2nd MCA error message and a MCA error message;
The module that reports an error, for when the 2nd MCA error message and a MCA error message are when inconsistent, stops and reporting an error;
Described the second acquisition module, also for when consistent, according to the test period setting in advance, reading the 2nd MCA error message with a MCA error message when the 2nd MCA error message.
Further, said apparatus also comprises:
Test module, for described read a MCA error message and preserve before, move described server system stability test software.
Further, server system is (SuSE) Linux OS.
Further, server system is used the Intel processor of supporting described MCA mechanism.
Further, a MCA error message and the 2nd MCA error message are stored in mcelog file, in be stored in/var/log of mcelog file syslog file folder.
Present techniques scheme comprises: read a MCA error message and preserve; According to the test period setting in advance, read the 2nd MCA error message; Relatively the 2nd MCA error message and a MCA error message, when the 2nd MCA error message and a MCA error message are when inconsistent, exits this flow process and reports an error; Otherwise, continue to carry out according to the test period setting in advance the step that reads the 2nd MCA error message and compare.The application's technical scheme can be avoided the problem that repeats to report an error, found timely MCA mistake and reported an error, larger raising the stability of server.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of method of the MCA mistake of discovery server system of the present invention;
Fig. 2 is the structural representation of device of the MCA mistake of discovery server system of the present invention.
Embodiment
Below in conjunction with drawings and the specific embodiments, the present invention is described in detail.
Fig. 1 is the process flow diagram of method of the MCA mistake of discovery server system of the present invention, as shown in Figure 1, comprises the following steps:
Step 101, reads a MCA error message and preserves.
MCA error message is stored in mcelog file, in be stored in/var/log of mcelog file syslog file folder.For the MCA error message with subsequent readout, distinguish mutually herein, the MCA error message of reading when here this flow process is carried out is for the first time called a MCA error message, and the MCA error message of subsequent readout is called to the second error message.
In this step, if there is MCA mistake in a MCA error message of reading, report an error and preserve a MCA error message; If there is not MCA mistake in a MCA error message of reading, preserve a MCA error message.
It should be noted that, after runtime server system stability testing software, in a MCA error message of reading, may not have MCA mistake.
Those skilled in the art know how by compile script language, to realize the MCA error message in the mcelog file under read/var/log syslog file folder, do not repeat them here.
Before this step, also comprise: runtime server system stability testing software.
After runtime server system stability testing software, read a MCA error message, the not too long time of interval.It should be noted that, the runtime server system stability testing software about how, is well-known to those skilled in the art, does not repeat them here.Why first runtime server system stability testing software, is that the chance that processor and internal memory are made mistakes is larger because when server system operation stability test.Like this so that subsequent step is found mistake.
Step 102, according to the test period setting in advance, reads the 2nd MCA error message.
Here the said test period setting in advance can determine according to the performance of actual conditions (as needs are found the wrong time in time) and server system, for example, can be half an hour, one hour, two hours etc., at this, does not limit.
Step 103, relatively whether the 2nd MCA error message is consistent with a MCA error message, if the two is inconsistent, exits this flow process and reports an error; If the two is consistent, return to step 102.
If the MCA error message newly increasing in syslog file folder all can be recorded in mcelog file, so if the 2nd MCA error message and a MCA error message are inconsistent, illustrative system has the MCA mistake newly increasing.
In the application's method, server system is (SuSE) Linux OS.
Further, server system is used the Intel processor of supporting MCA mechanism.
Fig. 2 is the structural representation of device of the MCA mistake of discovery server system of the present invention, as shown in Figure 2, comprising: the first acquisition module, the second acquisition module, comparison module and the module that reports an error.Wherein,
The first acquisition module, for reading a MCA error message and preserving.
If there is MCA mistake in a MCA error message of reading, report an error and preserve a MCA error message; If there is not MCA mistake in a MCA error message of reading, preserve a MCA error message.
The second acquisition module, for the test period according to setting in advance, reads the 2nd MCA error message in mcelog file.
The one MCA error message and the 2nd MCA error message are stored in mcelog file, in be stored in/var/log of mcelog file syslog file folder, it should be noted that, those skilled in the art know how by compile script language, to realize the MCA error message in the mcelog file under read/var/log syslog file folder, do not repeat them here.
The described here test period setting in advance can determine according to the performance of actual conditions (as needs are found the wrong time in time) and server system, for example, can be half an hour, one hour, two hours etc., at this, does not limit.
Comparison module, for relatively the 2nd MCA error message and a MCA error message.
The module that reports an error, for when the 2nd MCA error message and a MCA error message are when inconsistent, reports an error.
The second acquisition module, also for when consistent, according to the test period setting in advance, reading the 2nd MCA error message with a MCA error message when the 2nd MCA error message.
Device also comprises: test module, and for before reading a MCA error message and preserving, runtime server system stability testing software.
Further, server system is (SuSE) Linux OS.
Further, server system is used the Intel processor of supporting MCA mechanism.
One of ordinary skill in the art will appreciate that all or part of step in method can come instruction related hardware to complete by program, described program can be stored in computer-readable recording medium, as ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of embodiment also can realize with one or more integrated circuit.Correspondingly, each the module/unit in embodiment can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.The application is not restricted to the combination of the hardware and software of any particular form.
The above, be only preferred embodiments of the present invention, is not intended to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. a method for the self-verifying framework MCA mistake of discovery server system, comprising:
Read a MCA error message and preserve;
According to the test period setting in advance, read the 2nd MCA error message;
Relatively the 2nd MCA error message and a MCA error message, when the 2nd MCA error message and a MCA error message are when inconsistent, exits this flow process and reports an error; Otherwise, continue to carry out according to the test period setting in advance the step that reads the 2nd MCA error message and compare.
2. method according to claim 1, is characterized in that, described read a MCA error message and preserve before, described method also comprises: move described server system stability test software.
3. method according to claim 1, is characterized in that, described server system is (SuSE) Linux OS.
4. according to the method described in claim 1 or 3, it is characterized in that, described server system is used the Intel processor of supporting described MCA mechanism.
5. according to the method described in claim 1 or 3, it is characterized in that, a described MCA error message and the 2nd MCA error message are stored in mcelog file, in be stored in/var/log of mcelog file syslog file folder.
6. a device for the self-verifying framework MCA mistake of discovery server system, comprising:
The first acquisition module, for reading a MCA error message and preserving;
The second acquisition module, for the test period according to setting in advance, reads the 2nd MCA error message;
Comparison module, for relatively the 2nd MCA error message and a MCA error message;
The module that reports an error, for when the 2nd MCA error message and a MCA error message are when inconsistent, stops and reporting an error;
Described the second acquisition module, also for when consistent, according to the test period setting in advance, reading the 2nd MCA error message with a MCA error message when the 2nd MCA error message.
7. device according to claim 6, is characterized in that, described device also comprises:
Test module, for described read a MCA error message and preserve before, move described server system stability test software.
8. device according to claim 6, is characterized in that, described server system is (SuSE) Linux OS.
9. according to the device described in claim 6 or 8, it is characterized in that, described server system is used the Intel processor of supporting described MCA mechanism.
10. according to the device described in claim 6 or 8, it is characterized in that, a described MCA error message and the 2nd MCA error message are stored in mcelog file, in be stored in/var/log of mcelog file syslog file folder.
CN201410327862.5A 2014-07-10 2014-07-10 Method and device for finding MCA (machine check architecture) errors of server system Pending CN104102563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410327862.5A CN104102563A (en) 2014-07-10 2014-07-10 Method and device for finding MCA (machine check architecture) errors of server system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410327862.5A CN104102563A (en) 2014-07-10 2014-07-10 Method and device for finding MCA (machine check architecture) errors of server system

Publications (1)

Publication Number Publication Date
CN104102563A true CN104102563A (en) 2014-10-15

Family

ID=51670733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410327862.5A Pending CN104102563A (en) 2014-07-10 2014-07-10 Method and device for finding MCA (machine check architecture) errors of server system

Country Status (1)

Country Link
CN (1) CN104102563A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095007A (en) * 2015-08-21 2015-11-25 上海联影医疗科技有限公司 Hardware equipment error processing method and system
CN105302686A (en) * 2015-12-09 2016-02-03 浪潮电子信息产业股份有限公司 Memory target testing method
CN106126381A (en) * 2016-06-28 2016-11-16 浪潮(北京)电子信息产业有限公司 A kind of cpu fault event collection method and system based on Linux system
CN108920314A (en) * 2018-06-26 2018-11-30 郑州云海信息技术有限公司 A kind of faulty hardware localization method, device, system and readable storage medium storing program for executing
CN109324917A (en) * 2018-09-03 2019-02-12 郑州云海信息技术有限公司 A kind of acquisition method of server hardware fault log

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1752930A (en) * 2004-09-23 2006-03-29 华为技术有限公司 Chip program loading method
CN101354667A (en) * 2007-07-24 2009-01-28 英业达股份有限公司 Method for testing peripheral component interconnect bus level pressure
US20090138740A1 (en) * 2007-11-22 2009-05-28 Inventec Corporation Method and computer device capable of dealing with power fail
CN101509783A (en) * 2009-03-24 2009-08-19 北京四维图新科技股份有限公司 Data checking method and device applying to navigation electronic map production
CN101911740A (en) * 2007-11-18 2010-12-08 高通股份有限公司 Be used for the contact person of stores synchronized on smart card and the method and apparatus that is stored in the contact person of internal storage
CN102135875A (en) * 2010-01-21 2011-07-27 腾讯科技(深圳)有限公司 Method and server for prompting update of state content
CN102968362A (en) * 2012-11-21 2013-03-13 浪潮电子信息产业股份有限公司 Method for detecting integrity of PCIE (peripheral component interface express) equipment in system start-up process
CN103198000A (en) * 2013-04-02 2013-07-10 浪潮电子信息产业股份有限公司 Method for positioning faulted memory in linux system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1752930A (en) * 2004-09-23 2006-03-29 华为技术有限公司 Chip program loading method
CN101354667A (en) * 2007-07-24 2009-01-28 英业达股份有限公司 Method for testing peripheral component interconnect bus level pressure
CN101911740A (en) * 2007-11-18 2010-12-08 高通股份有限公司 Be used for the contact person of stores synchronized on smart card and the method and apparatus that is stored in the contact person of internal storage
US20090138740A1 (en) * 2007-11-22 2009-05-28 Inventec Corporation Method and computer device capable of dealing with power fail
CN101509783A (en) * 2009-03-24 2009-08-19 北京四维图新科技股份有限公司 Data checking method and device applying to navigation electronic map production
CN102135875A (en) * 2010-01-21 2011-07-27 腾讯科技(深圳)有限公司 Method and server for prompting update of state content
CN102968362A (en) * 2012-11-21 2013-03-13 浪潮电子信息产业股份有限公司 Method for detecting integrity of PCIE (peripheral component interface express) equipment in system start-up process
CN103198000A (en) * 2013-04-02 2013-07-10 浪潮电子信息产业股份有限公司 Method for positioning faulted memory in linux system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095007A (en) * 2015-08-21 2015-11-25 上海联影医疗科技有限公司 Hardware equipment error processing method and system
CN105302686A (en) * 2015-12-09 2016-02-03 浪潮电子信息产业股份有限公司 Memory target testing method
CN106126381A (en) * 2016-06-28 2016-11-16 浪潮(北京)电子信息产业有限公司 A kind of cpu fault event collection method and system based on Linux system
CN108920314A (en) * 2018-06-26 2018-11-30 郑州云海信息技术有限公司 A kind of faulty hardware localization method, device, system and readable storage medium storing program for executing
CN109324917A (en) * 2018-09-03 2019-02-12 郑州云海信息技术有限公司 A kind of acquisition method of server hardware fault log

Similar Documents

Publication Publication Date Title
DK3121726T3 (en) PROCEDURE FOR TROUBLESHOOTING, RELATED DEVICE AND COMPUTER
CN109783262B (en) Fault data processing method, device, server and computer readable storage medium
CN108388489B (en) Server fault diagnosis method, system, equipment and storage medium
CN104102563A (en) Method and device for finding MCA (machine check architecture) errors of server system
CN103198000A (en) Method for positioning faulted memory in linux system
US10983887B2 (en) Validation of multiprocessor hardware component
CN104375910A (en) Automated startup and shutdown testing method
CN107025224B (en) Method and equipment for monitoring task operation
US10528110B2 (en) Method for diagnosing power supply failure in a wireless communication device
CN108572895B (en) Stability test method for automatically checking software and hardware configuration under Linux
CN103984613A (en) Method for automatically testing floating point calculation performance of CPU (Central Processing Unit)
US20090307526A1 (en) Multi-cpu failure detection/recovery system and method for the same
JP2018092333A (en) Failure information management program, boot test method, and parallel processing apparatus
CN111124809B (en) Test method and device for server sensor system
JP5495310B2 (en) Information processing apparatus, failure analysis method, and failure analysis program
CN107357700A (en) A kind of method and system of test NVME hard disk order stability
JP2012058782A (en) Abnormality inspection method in computer device and computer device using the same
CN105786668A (en) Method for memory error detection based on Redhat system
CN102023916A (en) Computer system detection method
CN117407207B (en) Memory fault processing method and device, electronic equipment and storage medium
CN109687929B (en) Method for realizing HOST-BOX multi-stage cascade server time synchronization
JP2014059685A (en) Programmable logic device, information processor, suspect place pointing-out method and program
JP2014146110A (en) Information processing device, method for diagnosing error detection function, and computer program
CN115686962A (en) Server link detection method and device and electronic equipment
CN116662095A (en) Hard disk testing method, system, device, electronic equipment, storage medium and product

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141015