CN103970661A - Method for batched server memory fault detection through IPMI tool - Google Patents

Method for batched server memory fault detection through IPMI tool Download PDF

Info

Publication number
CN103970661A
CN103970661A CN201410211110.2A CN201410211110A CN103970661A CN 103970661 A CN103970661 A CN 103970661A CN 201410211110 A CN201410211110 A CN 201410211110A CN 103970661 A CN103970661 A CN 103970661A
Authority
CN
China
Prior art keywords
result
machine
txt
memory
echo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410211110.2A
Other languages
Chinese (zh)
Inventor
李双星
任华进
陈彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410211110.2A priority Critical patent/CN103970661A/en
Publication of CN103970661A publication Critical patent/CN103970661A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a method for batched server memory fault detection through an IPMI tool, and belongs to the field of fault detection. The method comprises the steps that recording and scanning bmc logs of all servers inside a network through the IPMI tool, analyzing machines with the memory problem from the result, conducting batched checking on the batched servers inside the network through a script, rapidly determining the machines with the memory ecc errors, and achieving memory batched checking of the batched machines. According to the method, the testing time is shortened, and working efficiency is improved.

Description

A kind of IPMI of utilization instrument carries out the method that bulk service device memory failure detects
technical field
the present invention the present invention relates in batches deployment server and has bmc and record the method that under memory failure functional conditions, memory problem detects in batches, and specifically a kind of IPMI of utilization instrument carries out the method that bulk service device memory failure detects.
Background technology
in computing machine, machine check framework (MCA) refers to the one mechanism of CPU report hard error in operating system, is a ras characteristic of cpu; For example, in the time that an ECC mistake produces, as EMS memory error, the register (MSRs) that is arranged in the various particular models of cpu can detect wrong generation, will trigger MCA mechanism; Then produce a system break, and will record various status informations at that time by various registers (MSRs), give bmc chip and give record, so the integrated bmc chip of mainboard can record internal memory run-time error at present, especially ecc reports an error, bmc has independently network configuration, can be configured to independent ip, and the bmc ip address of all machines can be configured to same network segment so that centralized management.
At present a large amount of Internet user's order quantity servers, and along with the maturation gradually of telemanagement technology, the management of server is no longer dependent on to server place machine room local management, but by network remote control, like this in the time that server occurs memory failure as ECC ERROR mistake, if do not checked by bmc, cannot pinpoint the problems in time, may bring impact to the stability of post-service device operation, so need timing, Servers-all is carried out to bmc daily record inspection, but for the machine of disposing in batches, the time that separate unit is tested is one by one oversize, work efficiency is too low.
Summary of the invention
The present invention checks and collects the method for each server ipmi interface data by batch, concentrate all gathering information, and filters out problematic machine, carries out in time Breakdown Maintenance.
A kind of IPMI of utilization instrument carries out the method that bulk service device memory failure detects, by ipmi instrument, the bmc daily record of netting interior Servers-all is carried out to writing scan, from result, analyze the machine that has memory problem, carry out batch inspection by script to netting interior bulk service device, to there being the internal memory ecc machine that reports an error to confirm fast, realizing the internal memory of machine in batches and check in batches.
1), look for a windows system machine, interconnection network after configuration ip, guarantees and client server supervising the network is communicated with,
2), amendment default script is to coordinate real network environment:
3), on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of execution is placed in the result.txt file of current directory,
4), carry out memory failure processing to detecting problematic machine.
It is as follows that acquiescence realizes script sel.bat:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)。
The invention has the beneficial effects as follows:
1. automatic batch inspection, raises the efficiency.
2. customizable script, is applicable to different network configuration environment.
3. implementation is simple, easy operating.
Embodiment
implementation procedure:
It is as follows that acquiescence realizes script sel.bat:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)
1, look for a windows system machine, interconnection network after configuration ip, guarantees to be communicated with client server supervising the network,
2, amendment default script is to coordinate real network environment:
If the on-the-spot network segment is 192.168.1.1-192.168.1.200, corresponding, will in sel.bat, revise:
for /L %%i in (82,1,90) do (
Be revised as for/L %%i in (1,1,200) do (
echo 10.7.12.%%i% >>result.txt
Be revised as echo 192.168.1.%%i% >>result.txt
Ipmitool.exe-H 10.7.12.%%i%-U admin-P admin sel list | find/i " ecc " >> result.txt is revised as ipmitool.exe-H 192.168.1.%%i%-U admin-P admin sel list | find/i " ecc " >> result.txt
3, on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of carrying out is placed in the result.txt file of current directory, following form, this station server of the 10.7.12.82 of example explanation below has ecc mistake, and the explanation of other skies does not have:
##############################################################################################
10.7.12.82
1 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert
3 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert
**********************************************************************************************
##############################################################################################
10.7.12.83
**********************************************************************************************
##############################################################################################
10.7.12.84
**********************************************************************************************
4, carry out memory failure processing to detecting problematic machine.

Claims (3)

1. one kind is utilized IPMI instrument to carry out the method that bulk service device memory failure detects, it is characterized in that, by ipmi instrument, the bmc daily record of netting interior Servers-all is carried out to writing scan, from result, analyze the machine that has memory problem, carry out batch inspection by script to netting interior bulk service device, to there being the internal memory ecc machine that reports an error to confirm fast, realizing the internal memory of machine in batches and check in batches.
2. method according to claim 1, is characterized in that
1), look for a windows system machine, interconnection network after configuration ip, guarantees and client server supervising the network is communicated with,
2), amendment default script is to coordinate real network environment:
3), on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of execution is placed in the result.txt file of current directory,
4), carry out memory failure processing to detecting problematic machine.
3. method according to claim 1, is characterized in that acquiescence realizes script sel.bat as follows:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)。
CN201410211110.2A 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool Pending CN103970661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410211110.2A CN103970661A (en) 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410211110.2A CN103970661A (en) 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool

Publications (1)

Publication Number Publication Date
CN103970661A true CN103970661A (en) 2014-08-06

Family

ID=51240190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410211110.2A Pending CN103970661A (en) 2014-05-19 2014-05-19 Method for batched server memory fault detection through IPMI tool

Country Status (1)

Country Link
CN (1) CN103970661A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268045A (en) * 2014-09-29 2015-01-07 浪潮电子信息产业股份有限公司 Testing method for startup and shutdown in remote control system
CN104333617A (en) * 2014-11-18 2015-02-04 浪潮电子信息产业股份有限公司 Method for automatically setting static state IP for rack cabinet in Linux system
CN104360922A (en) * 2014-10-20 2015-02-18 浪潮电子信息产业股份有限公司 Method for automatically monitoring BMC working state based on ipmitool
CN104714863A (en) * 2015-02-06 2015-06-17 浪潮电子信息产业股份有限公司 Method for completely storing Raid card logs on basis of Linux operation system after system crashes
CN105045689A (en) * 2015-06-25 2015-11-11 浪潮电子信息产业股份有限公司 Method for monitoring and alarming hard disks by using RAID card batch detection
CN106126368A (en) * 2016-08-22 2016-11-16 浪潮电子信息产业股份有限公司 Method for analyzing memory fault address under LINUX
CN106484639A (en) * 2016-10-10 2017-03-08 郑州云海信息技术有限公司 A kind of method that CPU register information is obtained by ipmi agreement
CN106603343A (en) * 2017-01-11 2017-04-26 郑州云海信息技术有限公司 A method for testing stability of servers in batch
CN106991026A (en) * 2017-04-28 2017-07-28 郑州云海信息技术有限公司 It is a kind of to pass through the method that network carries out server memory Rank margin test in batches
CN106997323A (en) * 2017-04-05 2017-08-01 广东浪潮大数据研究有限公司 A kind of recording method of server B MC problem repetition steps
CN107092549A (en) * 2017-04-26 2017-08-25 郑州云海信息技术有限公司 A kind of automatic monitoring and the instrument and method for parsing memory failure
CN107463455A (en) * 2017-08-01 2017-12-12 联想(北京)有限公司 A kind of method and device for detecting memory failure
CN108763005A (en) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 A kind of memory ECC failures error-reporting method and system
CN109032807A (en) * 2018-08-08 2018-12-18 郑州云海信息技术有限公司 A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory
CN110032486A (en) * 2019-03-06 2019-07-19 平安科技(深圳)有限公司 Server test method, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020144177A1 (en) * 1998-12-10 2002-10-03 Kondo Thomas J. System recovery from errors for processor and associated components
CN102799506A (en) * 2012-06-29 2012-11-28 浪潮电子信息产业股份有限公司 Method for positioning fault memory
CN103473141A (en) * 2013-09-13 2013-12-25 浪潮电子信息产业股份有限公司 Method for out-of-band check and modification of BIOS (basic input/output system) setting options
CN103593211A (en) * 2013-11-01 2014-02-19 浪潮电子信息产业股份有限公司 Method for refreshing and writing firmware programs through out-of-band isolation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020144177A1 (en) * 1998-12-10 2002-10-03 Kondo Thomas J. System recovery from errors for processor and associated components
CN102799506A (en) * 2012-06-29 2012-11-28 浪潮电子信息产业股份有限公司 Method for positioning fault memory
CN103473141A (en) * 2013-09-13 2013-12-25 浪潮电子信息产业股份有限公司 Method for out-of-band check and modification of BIOS (basic input/output system) setting options
CN103593211A (en) * 2013-11-01 2014-02-19 浪潮电子信息产业股份有限公司 Method for refreshing and writing firmware programs through out-of-band isolation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
乐晨: "ipmitool对linux服务器进行IPMI管理", 《HTTP://MY.OSCHINA.NET/DAVEHE/BLOG/88801》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268045A (en) * 2014-09-29 2015-01-07 浪潮电子信息产业股份有限公司 Testing method for startup and shutdown in remote control system
CN104360922A (en) * 2014-10-20 2015-02-18 浪潮电子信息产业股份有限公司 Method for automatically monitoring BMC working state based on ipmitool
CN104333617B (en) * 2014-11-18 2018-05-25 浪潮电子信息产业股份有限公司 A kind of method that rack cabinets set static IP automatically under linux system
CN104333617A (en) * 2014-11-18 2015-02-04 浪潮电子信息产业股份有限公司 Method for automatically setting static state IP for rack cabinet in Linux system
CN104714863A (en) * 2015-02-06 2015-06-17 浪潮电子信息产业股份有限公司 Method for completely storing Raid card logs on basis of Linux operation system after system crashes
CN105045689A (en) * 2015-06-25 2015-11-11 浪潮电子信息产业股份有限公司 Method for monitoring and alarming hard disks by using RAID card batch detection
CN106126368A (en) * 2016-08-22 2016-11-16 浪潮电子信息产业股份有限公司 Method for analyzing memory fault address under LINUX
CN106484639A (en) * 2016-10-10 2017-03-08 郑州云海信息技术有限公司 A kind of method that CPU register information is obtained by ipmi agreement
CN106603343A (en) * 2017-01-11 2017-04-26 郑州云海信息技术有限公司 A method for testing stability of servers in batch
CN106997323A (en) * 2017-04-05 2017-08-01 广东浪潮大数据研究有限公司 A kind of recording method of server B MC problem repetition steps
CN107092549A (en) * 2017-04-26 2017-08-25 郑州云海信息技术有限公司 A kind of automatic monitoring and the instrument and method for parsing memory failure
CN106991026A (en) * 2017-04-28 2017-07-28 郑州云海信息技术有限公司 It is a kind of to pass through the method that network carries out server memory Rank margin test in batches
CN107463455A (en) * 2017-08-01 2017-12-12 联想(北京)有限公司 A kind of method and device for detecting memory failure
CN107463455B (en) * 2017-08-01 2020-10-30 联想(北京)有限公司 Method and device for detecting memory fault
CN108763005A (en) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 A kind of memory ECC failures error-reporting method and system
CN108763005B (en) * 2018-05-30 2021-07-27 郑州云海信息技术有限公司 Memory ECC fault error reporting method and system
CN109032807A (en) * 2018-08-08 2018-12-18 郑州云海信息技术有限公司 A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory
CN110032486A (en) * 2019-03-06 2019-07-19 平安科技(深圳)有限公司 Server test method, device, computer equipment and storage medium
CN110032486B (en) * 2019-03-06 2022-08-09 平安科技(深圳)有限公司 Server testing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103970661A (en) Method for batched server memory fault detection through IPMI tool
US9824002B2 (en) Tracking of code base and defect diagnostic coupling with automated triage
US6944796B2 (en) Method and system to implement a system event log for system manageability
CN108388489B (en) Server fault diagnosis method, system, equipment and storage medium
US9569325B2 (en) Method and system for automated test and result comparison
CN107077412B (en) Automated root cause analysis for single or N-tier applications
CN103198000A (en) Method for positioning faulted memory in linux system
WO2020087954A1 (en) Method, apparatus, device and system for grabbing trace of nvme hard disk
US7464301B1 (en) Method and apparatus for capturing and logging activities of a state machine prior to error
CN106155883B (en) A kind of virtual machine method for testing reliability and device
KR20160044484A (en) Cloud deployment infrastructure validation engine
CN109976959A (en) A kind of portable device and method for server failure detection
WO2015116064A1 (en) End user monitoring to automate issue tracking
WO2015080742A1 (en) Production sampling for determining code coverage
US20190317875A1 (en) Electronic device and method for event logging
CN106126368A (en) Method for analyzing memory fault address under LINUX
US11429574B2 (en) Computer system diagnostic log chain
CN105743707A (en) Method for testing BMC log analysis function based on Redhat system
CN109408361A (en) Monkey tests restored method, device, electronic equipment and computer readable storage medium
WO2020087956A1 (en) Method, apparatus, device and system for capturing trace of nvme hard disc
JP2011145824A (en) Information processing apparatus, fault analysis method, and fault analysis program
CN106201753A (en) A kind of based on the processing method of PCIE mistake in linux and system
CN107562565A (en) A kind of method for verifying internal memory Patrol Scurb functions
Chuah et al. Using message logs and resource use data for cluster failure diagnosis
Chuah et al. Enabling dependability-driven resource use and message log-analysis for cluster system diagnosis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140806