CN107451051A - A kind of method that server memory diagnosis is carried out under Linux - Google Patents

A kind of method that server memory diagnosis is carried out under Linux Download PDF

Info

Publication number
CN107451051A
CN107451051A CN201710518822.2A CN201710518822A CN107451051A CN 107451051 A CN107451051 A CN 107451051A CN 201710518822 A CN201710518822 A CN 201710518822A CN 107451051 A CN107451051 A CN 107451051A
Authority
CN
China
Prior art keywords
diagnosis
error
data
row
under linux
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710518822.2A
Other languages
Chinese (zh)
Inventor
王为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710518822.2A priority Critical patent/CN107451051A/en
Publication of CN107451051A publication Critical patent/CN107451051A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/366Software debugging using diagnostics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The present invention relates to server failure detection field, and in particular to a kind of method for carrying out server memory diagnosis under linux.The memory diagnosis method can be in the case of linux system power off, the memory failure of discovery is positioned, so as to greatly improve fault diagnosis timeliness, reduce the influence that memory failure runs business to server, and memory diagnosis method of the present invention may be used in all kinds of server systems, there is good adaptability.

Description

A kind of method that server memory diagnosis is carried out under Linux
Technical field
The present invention relates to server failure detection field, and in particular to a kind of to carry out server memory diagnosis under linux Method.The memory diagnosis method can position in the case of linux system does not power off to the memory failure of discovery, from And fault diagnosis timeliness is greatly improved, and the influence that memory failure runs business to server is reduced, and it is of the present invention Memory diagnosis method may be used in all kinds of server systems, have good adaptability.
Background technology
With the rapid development of Internet, people are increasing to the demand of server, the application to server is also got over Carry out more extensive, and then the also more and more higher of the requirement to the indices of server.Server takes longer for work, and property It can stablize.But after server long-play, the probability to break down increases.Positioning in time is needed once breaking down And fix a breakdown.The positioning of traditional server memory failure needs server to shut down, and then takes out internal memory progress monomer and investigates one by one Analysis, but power-off operation can interrupt client traffic, bring the loss of economic interests.
In view of the above-mentioned problems, the present application one kind need not interrupt existing business, failure memory DQ (data are directly positioned Passage) method.
The content of the invention
Specifically, a kind of method that row diagnosis positioning is internally deposited under Linux environment is claimed in the application, and its feature exists In the diagnosis localization method specifically comprises the following steps:
1) SAD information is read, confirms the slot position that reports an error;
2) TAD information is read, confirms the channel position that reports an error;
3) RIR information is read, confirms the arrangement position that reports an error;
4) by address mapping table, column and row, Ku Ji and the position in storehouse are confirmed;
5) data are write to the address and reads the data, the data and the data of write-in are subjected to XOR, obtained The data channel that reports an error simultaneously generates LOG files.
The method that row diagnosis positioning is internally deposited under Linux environment as described above, is further characterized in that, above-mentioned diagnosis is determined The step of position, can be repeatedly.
The method that row diagnosis positioning is internally deposited under Linux environment as described above, is further characterized in that, above-mentioned diagnosis is determined Position the step of under linux system automatic running.
The method that row diagnosis positioning is internally deposited under Linux environment as described above, is further characterized in that, by LOG File analysis, EMS memory error can be navigated to specific passage, CPU, Home agant and DIMM Rank.
Embodiment
The present invention is to provide a kind of method that row diagnosis positioning is internally deposited under Linux environment, its implementation is:
1st, SAD (Decoder) information is read, confirms the socket (slot) that reports an error;
2nd, TAD (destination address decoding device) information is read, confirms the channel (passage) that reports an error;
3rd, RIR (arrangement exchanges scope) information is read, confirms the rank (arrangement) that reports an error;
4th, by address mapping table, col (row), row (OK), bank group (storehouse collection), bank (storehouse) are confirmed;
5th, data are write to the address and reads the data, the data and the data of write-in are subjected to XOR, obtained Report an error and DQ and generate LOG files.
It it is below the step of specifically performing the diagnostic test:
1st, EccMon programs are copied to linux system
Operating instruction is as follows:
EccMon EFI help license.rtf src syslinux
2nd, EccMon programs are run
The instruction of operation is as follows:
./EccMon i=1000/f=error.xml
I=1000 is circulated 1000 times to set
F=error.xml is that test result is saved as xml document.
Can generation error record LOG files error.xml after program end of run
<ErrorData PhysAddress=" 0000000000000000 " UnkMask=" 3FFFFFFFFFE0 " ErrBits=" 0000000000020000 " Node=" 0 " HA=" 0 " Chan=" 0 " Rank=" 1 " Count=" 128 " 0verflow=" 0 "/>
<ErrorData PhysAddress=" 0000000000000008 " UnkMask=" 3FFFFFFFFFE0 " ErrBits=" 0000000000020000 " Node=" 0 " HA=" 0 " Chan=" 0 " Rank=" 1 " Count=" 128 " Overflow=" 0 "/>
<ErrorData PhysAddress=" 0000000000000010 " UnkMask=" 3FFFFFFFFFE0 " ErrBits=" 0000000000020000 " Node=" 0 " HA=" 0 " Chan=" 0 " Rank=" 1 " Count=" 128 " 0verflow=" 0 "/>
Parse error.xml files:
ErrBits is converted into binary number, corresponding DQ0-63, Node CPU, HA are that home agent, Chan are logical Road, Rank are DIMM Rank.
For example, error information is DQ 17, CPU 0, HA 0, Channe 10, Rank 1 in LOG files.
It should be evident that illustrated above is only one embodiment of the present of invention, for those of ordinary skill in the art For, on the premise of not paying creative work, other technical schemes can also be obtained according to the embodiment, belong to this Invent the scope of protection.
, can be in linux system not the invention provides a kind of method that row diagnosis positioning is internally deposited under Linux environment Under powering-off state, the memory failure of discovery is positioned, so as to greatly improve fault diagnosis timeliness, reduces memory failure pair Server runs the influence of business.After technical solutions according to the invention can also be adjusted, ordinary individual's calculating is applied to Machine, method simple possible and obvious technical effects, it can be applied in practice extensively.

Claims (4)

1. the method for row diagnosis positioning is internally deposited under a kind of Linux environment, it is characterised in that the diagnosis localization method specifically wraps Include following steps:
1) SAD information is read, confirms the slot position that reports an error;
2) TAD information is read, confirms the channel position that reports an error;
3) RIR information is read, confirms the arrangement position that reports an error;
4) by address mapping table, column and row, Ku Ji and the position in storehouse are confirmed;
5) data are write to the address and reads the data, the data and the data of write-in are subjected to XOR, reported an error Data channel simultaneously generates LOG files.
2. internally depositing into the method for row diagnosis positioning under Linux environment as claimed in claim 1, it is further characterized in that, it is above-mentioned The step of diagnosis positioning, can be repeatedly.
3. internally depositing into the method for row diagnosis positioning under Linux environment as claimed in claim 2, it is further characterized in that, it is above-mentioned The step of diagnosis positions automatic running under linux system.
4. internally depositing into the method for row diagnosis positioning under Linux environment as claimed in claim 3, it is further characterized in that, passes through To LOG file analyses, EMS memory error can be navigated to specific passage, CPU, Home agant and DIMM Rank.
CN201710518822.2A 2017-06-29 2017-06-29 A kind of method that server memory diagnosis is carried out under Linux Pending CN107451051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710518822.2A CN107451051A (en) 2017-06-29 2017-06-29 A kind of method that server memory diagnosis is carried out under Linux

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710518822.2A CN107451051A (en) 2017-06-29 2017-06-29 A kind of method that server memory diagnosis is carried out under Linux

Publications (1)

Publication Number Publication Date
CN107451051A true CN107451051A (en) 2017-12-08

Family

ID=60488164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710518822.2A Pending CN107451051A (en) 2017-06-29 2017-06-29 A kind of method that server memory diagnosis is carried out under Linux

Country Status (1)

Country Link
CN (1) CN107451051A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804252A (en) * 2018-06-15 2018-11-13 郑州云海信息技术有限公司 A kind of server memory fault detection method, device, equipment and storage medium
CN109669830A (en) * 2018-12-25 2019-04-23 上海创功通讯技术有限公司 A kind of physical detection methods and terminal device for memory

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6671837B1 (en) * 2000-06-06 2003-12-30 Intel Corporation Device and method to test on-chip memory in a production environment
CN102135925A (en) * 2010-12-27 2011-07-27 西安锐信科技有限公司 Method and device for detecting error check and correcting memory
CN104391753A (en) * 2014-12-16 2015-03-04 浪潮电子信息产业股份有限公司 Failure-free operation method of server mainboard memory system
CN105589770A (en) * 2015-07-20 2016-05-18 杭州昆海信息技术有限公司 Fault detection method and apparatus
CN106126368A (en) * 2016-08-22 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method of memory failure address resolution under LINUX

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6671837B1 (en) * 2000-06-06 2003-12-30 Intel Corporation Device and method to test on-chip memory in a production environment
CN102135925A (en) * 2010-12-27 2011-07-27 西安锐信科技有限公司 Method and device for detecting error check and correcting memory
CN104391753A (en) * 2014-12-16 2015-03-04 浪潮电子信息产业股份有限公司 Failure-free operation method of server mainboard memory system
CN105589770A (en) * 2015-07-20 2016-05-18 杭州昆海信息技术有限公司 Fault detection method and apparatus
CN106126368A (en) * 2016-08-22 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method of memory failure address resolution under LINUX

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804252A (en) * 2018-06-15 2018-11-13 郑州云海信息技术有限公司 A kind of server memory fault detection method, device, equipment and storage medium
CN109669830A (en) * 2018-12-25 2019-04-23 上海创功通讯技术有限公司 A kind of physical detection methods and terminal device for memory
CN109669830B (en) * 2018-12-25 2022-04-22 上海创功通讯技术有限公司 Physical detection method for memory and terminal equipment

Similar Documents

Publication Publication Date Title
CN105589762B (en) Memory device, memory module and method for error correction
US20130212436A1 (en) Method and system for detecting abnormality of network processor
US10078567B2 (en) Implementing fault tolerance in computer system memory
CN103226598B (en) Access method and apparatus and the data base management system of data base
US20130198571A1 (en) System and Method of Computation by Signature Analysis
CN102982264A (en) Method for protecting embedded type device software
CN102135925B (en) Method and device for detecting error check and correcting memory
CN106294222A (en) A kind of method and device determining PCIE device and slot corresponding relation
CN103703447B (en) MRAM field disturb detection and recovery
CN107451051A (en) A kind of method that server memory diagnosis is carried out under Linux
CN105408869B (en) Call error processing routine handles the mistake that can not be corrected
US10963395B2 (en) Memory system
WO2021243740A1 (en) Code instrumentation framework system based on ethereum virtual machine
CN107562565A (en) A kind of method for verifying internal memory Patrol Scurb functions
US9424164B2 (en) Memory error tracking in a multiple-user development environment
CN104461798A (en) Random number validation method for processor arithmetic logic unit instruction
CN106803036B (en) Safety detection and fault tolerance method for data stream in system operation
CN104781790A (en) Signaling software recoverable errors
US20230025081A1 (en) Model training method, failure determining method, electronic device, and program product
CN115729477A (en) Distributed storage IO path data writing and reading method, device and equipment
US11593209B2 (en) Targeted repair of hardware components in a computing device
US10140186B2 (en) Memory error recovery
CN113806443A (en) Trusted data storage method, system, medium, equipment and terminal
US10268418B1 (en) Accessing multiple data snapshots via one access point
US20090019309A1 (en) Method and computer program product for determining a minimally degraded configuration when failures occur along connections

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171208

RJ01 Rejection of invention patent application after publication