CN103970661A - Method for batched server memory fault detection through IPMI tool - Google Patents
Method for batched server memory fault detection through IPMI tool Download PDFInfo
- Publication number
- CN103970661A CN103970661A CN201410211110.2A CN201410211110A CN103970661A CN 103970661 A CN103970661 A CN 103970661A CN 201410211110 A CN201410211110 A CN 201410211110A CN 103970661 A CN103970661 A CN 103970661A
- Authority
- CN
- China
- Prior art keywords
- result
- machine
- txt
- memory
- echo
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention provides a method for batched server memory fault detection through an IPMI tool, and belongs to the field of fault detection. The method comprises the steps that recording and scanning bmc logs of all servers inside a network through the IPMI tool, analyzing machines with the memory problem from the result, conducting batched checking on the batched servers inside the network through a script, rapidly determining the machines with the memory ecc errors, and achieving memory batched checking of the batched machines. According to the method, the testing time is shortened, and working efficiency is improved.
Description
technical field
the present invention the present invention relates in batches deployment server and has bmc and record the method that under memory failure functional conditions, memory problem detects in batches, and specifically a kind of IPMI of utilization instrument carries out the method that bulk service device memory failure detects.
Background technology
in computing machine, machine check framework (MCA) refers to the one mechanism of CPU report hard error in operating system, is a ras characteristic of cpu; For example, in the time that an ECC mistake produces, as EMS memory error, the register (MSRs) that is arranged in the various particular models of cpu can detect wrong generation, will trigger MCA mechanism; Then produce a system break, and will record various status informations at that time by various registers (MSRs), give bmc chip and give record, so the integrated bmc chip of mainboard can record internal memory run-time error at present, especially ecc reports an error, bmc has independently network configuration, can be configured to independent ip, and the bmc ip address of all machines can be configured to same network segment so that centralized management.
At present a large amount of Internet user's order quantity servers, and along with the maturation gradually of telemanagement technology, the management of server is no longer dependent on to server place machine room local management, but by network remote control, like this in the time that server occurs memory failure as ECC ERROR mistake, if do not checked by bmc, cannot pinpoint the problems in time, may bring impact to the stability of post-service device operation, so need timing, Servers-all is carried out to bmc daily record inspection, but for the machine of disposing in batches, the time that separate unit is tested is one by one oversize, work efficiency is too low.
Summary of the invention
The present invention checks and collects the method for each server ipmi interface data by batch, concentrate all gathering information, and filters out problematic machine, carries out in time Breakdown Maintenance.
A kind of IPMI of utilization instrument carries out the method that bulk service device memory failure detects, by ipmi instrument, the bmc daily record of netting interior Servers-all is carried out to writing scan, from result, analyze the machine that has memory problem, carry out batch inspection by script to netting interior bulk service device, to there being the internal memory ecc machine that reports an error to confirm fast, realizing the internal memory of machine in batches and check in batches.
1), look for a windows system machine, interconnection network after configuration ip, guarantees and client server supervising the network is communicated with,
2), amendment default script is to coordinate real network environment:
3), on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of execution is placed in the result.txt file of current directory,
4), carry out memory failure processing to detecting problematic machine.
It is as follows that acquiescence realizes script sel.bat:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)。
The invention has the beneficial effects as follows:
1. automatic batch inspection, raises the efficiency.
2. customizable script, is applicable to different network configuration environment.
3. implementation is simple, easy operating.
Embodiment
implementation procedure:
It is as follows that acquiescence realizes script sel.bat:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)
1, look for a windows system machine, interconnection network after configuration ip, guarantees to be communicated with client server supervising the network,
2, amendment default script is to coordinate real network environment:
If the on-the-spot network segment is 192.168.1.1-192.168.1.200, corresponding, will in sel.bat, revise:
for /L %%i in (82,1,90) do (
Be revised as for/L %%i in (1,1,200) do (
echo 10.7.12.%%i% >>result.txt
Be revised as echo 192.168.1.%%i% >>result.txt
Ipmitool.exe-H 10.7.12.%%i%-U admin-P admin sel list | find/i " ecc " >> result.txt is revised as ipmitool.exe-H 192.168.1.%%i%-U admin-P admin sel list | find/i " ecc " >> result.txt
3, on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of carrying out is placed in the result.txt file of current directory, following form, this station server of the 10.7.12.82 of example explanation below has ecc mistake, and the explanation of other skies does not have:
##############################################################################################
10.7.12.82
1 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert
3 | Pre-Init Time-stamp | Memory #0x16 | uncorrected-ECC Assert
**********************************************************************************************
##############################################################################################
10.7.12.83
**********************************************************************************************
##############################################################################################
10.7.12.84
**********************************************************************************************
4, carry out memory failure processing to detecting problematic machine.
Claims (3)
1. one kind is utilized IPMI instrument to carry out the method that bulk service device memory failure detects, it is characterized in that, by ipmi instrument, the bmc daily record of netting interior Servers-all is carried out to writing scan, from result, analyze the machine that has memory problem, carry out batch inspection by script to netting interior bulk service device, to there being the internal memory ecc machine that reports an error to confirm fast, realizing the internal memory of machine in batches and check in batches.
2. method according to claim 1, is characterized in that
1), look for a windows system machine, interconnection network after configuration ip, guarantees and client server supervising the network is communicated with,
2), amendment default script is to coordinate real network environment:
3), on windows machine, carry out script, coordinate ipmitool.exe and libeay32.dll Tool-file, the net result of execution is placed in the result.txt file of current directory,
4), carry out memory failure processing to detecting problematic machine.
3. method according to claim 1, is characterized in that acquiescence realizes script sel.bat as follows:
@echo off
for /L %%i in (82,1,90) do (
@echo ##############################################################################################>> result.txt
echo 10.7.12.%%i% >>result.txt
ipmitool.exe -H 10.7.12.%%i% -U admin -P admin sel list | find /i "ecc" >> result.txt
@echo **********************************************************************************************>> result.txt
)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410211110.2A CN103970661A (en) | 2014-05-19 | 2014-05-19 | Method for batched server memory fault detection through IPMI tool |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410211110.2A CN103970661A (en) | 2014-05-19 | 2014-05-19 | Method for batched server memory fault detection through IPMI tool |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103970661A true CN103970661A (en) | 2014-08-06 |
Family
ID=51240190
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410211110.2A Pending CN103970661A (en) | 2014-05-19 | 2014-05-19 | Method for batched server memory fault detection through IPMI tool |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970661A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104268045A (en) * | 2014-09-29 | 2015-01-07 | 浪潮电子信息产业股份有限公司 | Testing method for startup and shutdown in remote control system |
CN104333617A (en) * | 2014-11-18 | 2015-02-04 | 浪潮电子信息产业股份有限公司 | Method for automatically setting static state IP for rack cabinet in Linux system |
CN104360922A (en) * | 2014-10-20 | 2015-02-18 | 浪潮电子信息产业股份有限公司 | Method for automatically monitoring BMC working state based on ipmitool |
CN104714863A (en) * | 2015-02-06 | 2015-06-17 | 浪潮电子信息产业股份有限公司 | Method for completely storing Raid card logs on basis of Linux operation system after system crashes |
CN105045689A (en) * | 2015-06-25 | 2015-11-11 | 浪潮电子信息产业股份有限公司 | Method for monitoring and alarming hard disks by using RAID card batch detection |
CN106126368A (en) * | 2016-08-22 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | Method for analyzing memory fault address under LINUX |
CN106484639A (en) * | 2016-10-10 | 2017-03-08 | 郑州云海信息技术有限公司 | A kind of method that CPU register information is obtained by ipmi agreement |
CN106603343A (en) * | 2017-01-11 | 2017-04-26 | 郑州云海信息技术有限公司 | A method for testing stability of servers in batch |
CN106991026A (en) * | 2017-04-28 | 2017-07-28 | 郑州云海信息技术有限公司 | It is a kind of to pass through the method that network carries out server memory Rank margin test in batches |
CN106997323A (en) * | 2017-04-05 | 2017-08-01 | 广东浪潮大数据研究有限公司 | A kind of recording method of server B MC problem repetition steps |
CN107092549A (en) * | 2017-04-26 | 2017-08-25 | 郑州云海信息技术有限公司 | A kind of automatic monitoring and the instrument and method for parsing memory failure |
CN107463455A (en) * | 2017-08-01 | 2017-12-12 | 联想(北京)有限公司 | A kind of method and device for detecting memory failure |
CN108763005A (en) * | 2018-05-30 | 2018-11-06 | 郑州云海信息技术有限公司 | A kind of memory ECC failures error-reporting method and system |
CN109032807A (en) * | 2018-08-08 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory |
CN110032486A (en) * | 2019-03-06 | 2019-07-19 | 平安科技(深圳)有限公司 | Server test method, device, computer equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020144177A1 (en) * | 1998-12-10 | 2002-10-03 | Kondo Thomas J. | System recovery from errors for processor and associated components |
CN102799506A (en) * | 2012-06-29 | 2012-11-28 | 浪潮电子信息产业股份有限公司 | Method for positioning fault memory |
CN103473141A (en) * | 2013-09-13 | 2013-12-25 | 浪潮电子信息产业股份有限公司 | Method for out-of-band check and modification of BIOS (basic input/output system) setting options |
CN103593211A (en) * | 2013-11-01 | 2014-02-19 | 浪潮电子信息产业股份有限公司 | Method for refreshing and writing firmware programs through out-of-band isolation |
-
2014
- 2014-05-19 CN CN201410211110.2A patent/CN103970661A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020144177A1 (en) * | 1998-12-10 | 2002-10-03 | Kondo Thomas J. | System recovery from errors for processor and associated components |
CN102799506A (en) * | 2012-06-29 | 2012-11-28 | 浪潮电子信息产业股份有限公司 | Method for positioning fault memory |
CN103473141A (en) * | 2013-09-13 | 2013-12-25 | 浪潮电子信息产业股份有限公司 | Method for out-of-band check and modification of BIOS (basic input/output system) setting options |
CN103593211A (en) * | 2013-11-01 | 2014-02-19 | 浪潮电子信息产业股份有限公司 | Method for refreshing and writing firmware programs through out-of-band isolation |
Non-Patent Citations (1)
Title |
---|
乐晨: "ipmitool对linux服务器进行IPMI管理", 《HTTP://MY.OSCHINA.NET/DAVEHE/BLOG/88801》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104268045A (en) * | 2014-09-29 | 2015-01-07 | 浪潮电子信息产业股份有限公司 | Testing method for startup and shutdown in remote control system |
CN104360922A (en) * | 2014-10-20 | 2015-02-18 | 浪潮电子信息产业股份有限公司 | Method for automatically monitoring BMC working state based on ipmitool |
CN104333617B (en) * | 2014-11-18 | 2018-05-25 | 浪潮电子信息产业股份有限公司 | A kind of method that rack cabinets set static IP automatically under linux system |
CN104333617A (en) * | 2014-11-18 | 2015-02-04 | 浪潮电子信息产业股份有限公司 | Method for automatically setting static state IP for rack cabinet in Linux system |
CN104714863A (en) * | 2015-02-06 | 2015-06-17 | 浪潮电子信息产业股份有限公司 | Method for completely storing Raid card logs on basis of Linux operation system after system crashes |
CN105045689A (en) * | 2015-06-25 | 2015-11-11 | 浪潮电子信息产业股份有限公司 | Method for monitoring and alarming hard disks by using RAID card batch detection |
CN106126368A (en) * | 2016-08-22 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | Method for analyzing memory fault address under LINUX |
CN106484639A (en) * | 2016-10-10 | 2017-03-08 | 郑州云海信息技术有限公司 | A kind of method that CPU register information is obtained by ipmi agreement |
CN106603343A (en) * | 2017-01-11 | 2017-04-26 | 郑州云海信息技术有限公司 | A method for testing stability of servers in batch |
CN106997323A (en) * | 2017-04-05 | 2017-08-01 | 广东浪潮大数据研究有限公司 | A kind of recording method of server B MC problem repetition steps |
CN107092549A (en) * | 2017-04-26 | 2017-08-25 | 郑州云海信息技术有限公司 | A kind of automatic monitoring and the instrument and method for parsing memory failure |
CN106991026A (en) * | 2017-04-28 | 2017-07-28 | 郑州云海信息技术有限公司 | It is a kind of to pass through the method that network carries out server memory Rank margin test in batches |
CN107463455A (en) * | 2017-08-01 | 2017-12-12 | 联想(北京)有限公司 | A kind of method and device for detecting memory failure |
CN107463455B (en) * | 2017-08-01 | 2020-10-30 | 联想(北京)有限公司 | Method and device for detecting memory fault |
CN108763005A (en) * | 2018-05-30 | 2018-11-06 | 郑州云海信息技术有限公司 | A kind of memory ECC failures error-reporting method and system |
CN108763005B (en) * | 2018-05-30 | 2021-07-27 | 郑州云海信息技术有限公司 | Memory ECC fault error reporting method and system |
CN109032807A (en) * | 2018-08-08 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of batch monitors the method and system of internal storage state and limitation power consumption of internal memory |
CN110032486A (en) * | 2019-03-06 | 2019-07-19 | 平安科技(深圳)有限公司 | Server test method, device, computer equipment and storage medium |
CN110032486B (en) * | 2019-03-06 | 2022-08-09 | 平安科技(深圳)有限公司 | Server testing method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103970661A (en) | Method for batched server memory fault detection through IPMI tool | |
US9824002B2 (en) | Tracking of code base and defect diagnostic coupling with automated triage | |
US6944796B2 (en) | Method and system to implement a system event log for system manageability | |
CN108388489B (en) | Server fault diagnosis method, system, equipment and storage medium | |
US9569325B2 (en) | Method and system for automated test and result comparison | |
CN107077412B (en) | Automated root cause analysis for single or N-tier applications | |
CN103198000A (en) | Method for positioning faulted memory in linux system | |
WO2020087954A1 (en) | Method, apparatus, device and system for grabbing trace of nvme hard disk | |
US7464301B1 (en) | Method and apparatus for capturing and logging activities of a state machine prior to error | |
CN106155883B (en) | A kind of virtual machine method for testing reliability and device | |
KR20160044484A (en) | Cloud deployment infrastructure validation engine | |
CN109976959A (en) | A kind of portable device and method for server failure detection | |
WO2015116064A1 (en) | End user monitoring to automate issue tracking | |
WO2015080742A1 (en) | Production sampling for determining code coverage | |
US20190317875A1 (en) | Electronic device and method for event logging | |
CN106126368A (en) | Method for analyzing memory fault address under LINUX | |
US11429574B2 (en) | Computer system diagnostic log chain | |
CN105743707A (en) | Method for testing BMC log analysis function based on Redhat system | |
CN109408361A (en) | Monkey tests restored method, device, electronic equipment and computer readable storage medium | |
WO2020087956A1 (en) | Method, apparatus, device and system for capturing trace of nvme hard disc | |
JP2011145824A (en) | Information processing apparatus, fault analysis method, and fault analysis program | |
CN106201753A (en) | A kind of based on the processing method of PCIE mistake in linux and system | |
CN107562565A (en) | A kind of method for verifying internal memory Patrol Scurb functions | |
Chuah et al. | Using message logs and resource use data for cluster failure diagnosis | |
Chuah et al. | Enabling dependability-driven resource use and message log-analysis for cluster system diagnosis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140806 |