CN112948160B

CN112948160B - Method and device for positioning and repairing memory ECC problem

Info

Publication number: CN112948160B
Application number: CN202110219990.8A
Authority: CN
Inventors: 许雪雪; 姜庆臣
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2023-02-28
Anticipated expiration: 2041-02-26
Also published as: CN112948160A

Abstract

The invention provides a positioning and repairing method for an ECC problem of a memory, which comprises the following steps: the method comprises the following steps that a client server to be tested is in communication connection with a server, and the server controls a plurality of client servers to be tested to perform memory ECC problem testing; the method comprises the steps that a server-side server obtains a test log of a client-side server to be tested, and the memory ECC problem is divided into a recoverable ECC problem and a non-recoverable ECC problem according to the test log; the invention also provides a device for positioning and repairing the ECC problem of the memory, which realizes the repair of the memory bank with the repairable ECC problem by modifying the BIOS option and improves the quality of the memory of the server.

Description

Method and device for positioning and repairing memory ECC problem

Technical Field

The present invention relates to the field of memory ECC problem, and in particular, to a method and an apparatus for locating and repairing a memory ECC problem.

Background

In the current server field, the use of ramos (memory operating system, i.e. diskless system) is becoming more and more common, especially in the areas of development, testing, production, etc. Due to the particularity of ramos, problems or probabilistic events occurring in the process of pressure testing are difficult to locate, and even if the problems are found, time is needed to analyze the hardware position of the corresponding server; the recurring problems take up a great deal of time and labor.

In the prior art, if an ECC (Error Correcting Code) problem occurs, if a machine is not down, a slot corresponding to the ECC problem may be located by directly capturing log information; if the machine crashes, the problem needs to be reproduced, and log information is captured in real time in a serial port line connection mode.

However, log information is captured in a serial port line mode, and the occupied time is long; in the mode of directly capturing the log, the time for reproducing the problem is long, and the problem is difficult to reproduce in the case of a probabilistic problem, but the probability still exists; after the existing memory ECC problem is located, only the memory with the ECC problem can be shielded, the memory with the ECC problem cannot be repaired, and the problem of the server memory is not solved favorably.

Disclosure of Invention

In order to solve the problems in the prior art, the invention innovatively provides a method and a device for positioning and repairing the ECC problem of the memory, so that the memory bank with the repairable ECC problem is repaired, the quality of the memory of the server is improved, and the reliability of the ECC problem test of the memory of the server is effectively improved.

The first aspect of the present invention provides a method for locating and repairing an ECC problem in a memory, including:

the method comprises the following steps that a client server to be tested is in communication connection with a server, and the server controls a plurality of client servers to be tested to perform memory ECC problem testing;

the method comprises the steps that a server-side server obtains a test log of a client-side server to be tested, and divides an internal memory ECC problem into a repairable ECC problem and an unrepairable ECC problem according to the test log;

and positioning the memory capable of repairing the ECC problem, counting the error reporting times of the memory, and automatically repairing the positioned memory bank capable of repairing the ECC problem by modifying the BIOS option if the error reporting times of the memory exceed a preset value.

Optionally, the communication connection between the client server to be tested and the server, where the step of controlling, by the server, the multiple client servers to be tested to perform the memory ECC problem test specifically includes:

building a network test environment, and accessing each client server to be tested and a server to the same switch, wherein each client server to be tested and the server are in the same network segment;

configuring an operating system and a kernel in a server-side server, and establishing connection between each client-side server to be tested and the server-side server through PXE (PCI extensions for instrumentation) guidance;

each client server is started up and started up automatically, the actual use scene of a user is simulated, and the memory ECC problem test is carried out;

in the testing process, if the memory mce error occurs, the testing is terminated, and a testing log is recorded.

Optionally, the dividing the memory ECC problem into the recoverable ECC problem and the unrepairable ECC problem according to the test log specifically includes:

detecting whether a repairable flag field exists in the test log, if so, determining that the ECC problem is a repairable ECC problem;

and detecting whether an uncorrectable flag field exists in the test log, and if so, determining that the ECC problem is an uncorrectable ECC problem.

Further, the repairable flag field is 0xa0, and the unrepairable flag field is 0xa1.

Optionally, the automatically repairing the located memory bank which can repair the ECC problem by modifying the BIOS option specifically includes:

using a BIOS tool to export BIOS options, and modifying the memory enhancement test options in the BIOS options into test repair options;

and after the modification is finished, the BIOS is introduced again, and after the BIOS option is confirmed to be successfully modified, the server is restarted to automatically repair the BIOS.

Further, the Memory enhancement Test option is Enhanced Memory Test, and the Test Repair option is Test and Repair.

Optionally, the BIOS supports memory enhancement functions.

Optionally, the method further comprises: and if the error reporting times of the memory do not exceed the preset value, restarting the client server and carrying out the memory ECC problem test again.

Optionally, the method further comprises: and positioning the memory with the uncorrectable ECC problem and analyzing the cause of the problem.

The second aspect of the present invention provides a positioning and repairing apparatus for memory ECC problem, comprising:

the test module is used for connecting the client server to be tested with the server in a communication way, and the server controls a plurality of client servers to be tested to test the memory ECC problem;

the system comprises a dividing module, a server side server and a server side server, wherein the server side server acquires a test log of a client side server to be tested, and divides the memory ECC problem into a recoverable ECC problem and an unrepairable ECC problem according to the test log;

and the positioning and repairing module is used for positioning the memory capable of repairing the ECC problem, counting the error reporting times of the memory, and automatically repairing the positioned memory bank capable of repairing the ECC problem by modifying the BIOS option if the error reporting times of the memory exceed a preset value.

The technical scheme adopted by the invention comprises the following technical effects:

1. the ECC problem generated in the memory test is classified into the repairable ECC problem and the unrepairable ECC problem, the memory bank with the repairable ECC problem is repaired by modifying the BIOS option, the quality of the server memory is improved, and the reliability of the server memory ECC problem test is effectively improved.

2. According to the technical scheme, the memory bank with ECC problems can be automatically repaired in the restarting process of the server by modifying the BIOS option, and the repairing efficiency of the memory bank is improved.

3. In the technical scheme of the invention, if the error reporting times of the memory do not exceed the preset value, the client server is restarted, and the memory ECC problem test is carried out again, so that the error positioning caused by the ECC problem not caused by the memory is avoided, and the reliability of the memory ECC problem test is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the embodiments or technical solutions in the prior art are briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of step S1 in one embodiment of the method of the present invention;

FIG. 3 is another schematic flow diagram of a process in accordance with an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a second apparatus according to an embodiment of the present invention.

Detailed Description

In order to clearly explain the technical features of the present invention, the present invention will be explained in detail by the following embodiments and the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, specific example components and arrangements are described below. Moreover, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.

Example one

As shown in fig. 1, the present invention provides a method for locating and repairing an ECC problem in a memory, including:

s1, a client server to be tested is in communication connection with a server, and the server controls a plurality of client servers to be tested to perform memory ECC problem testing;

s2, the server side server obtains a test log of the client side server to be tested, and divides the memory ECC problem into a repairable ECC problem and an unrepairable ECC problem according to the test log;

s3, positioning the memory capable of repairing the ECC problem, and counting the error reporting times of the memory;

s4, judging whether the counted times exceed a preset value or not, and if so, executing the step S5;

s5, automatically repairing the positioned memory bank which can repair the ECC problem by modifying the BIOS option;

and S6, restarting the client server and carrying out the memory ECC problem test again.

In step S1, as shown in fig. 2, step S1 specifically includes:

s11, building a network test environment, and accessing each client server to be tested and a server into the same switch, wherein each client server to be tested and the server are in the same network segment;

s12, configuring an operating system and a kernel in the server-side server, and establishing connection between each client-side server to be tested and the server-side server through PXE (PCI extensions for instrumentation) guidance;

s13, each client server is started up automatically, the actual use scene of a user is simulated, and the memory ECC problem test is carried out;

and S14, in the testing process, if the memory mce error occurs, the testing is terminated, and a testing log is recorded.

In step S11, the client server and the server to be tested are in the same network segment, and the client server and the server to be tested are in the same network segment by automatically allocating an ip address, a subnet mask, a broadcast address, and the like.

In step S12, an Operating System (OS), a kernel, and a boot file (mac) are configured in the server, and each client server to be tested is booted in a PXE (Preboot eXecution Environment) manner to establish a connection with the server.

In step S13, the boot self-starting may be implemented by a program, and the program may specifically be an etc/rc. After the startup self-starting setting is completed, the client server to be tested automatically enters an operating system after being started or restarted, and a test program is automatically or manually operated to simulate the actual use scene of a user.

In step S14, in the testing process, if an error occurs in the memory mc (Machine Check Exception, an Exception triggered when the CPU finds a hardware error), the testing is terminated, and a testing log is recorded through the messages, i.e., a self-contained log file under the linux system/var/log directory, and a system testing log is recorded.

In step S2, the client server to be tested returns the test log to the server in an NTFS (log file system) manner, and the server obtains the test log of the client server to be tested and saves and exports a sel log (a part of the test log) thereof by an ipmitool tool; if the state is the starting state, the state can be stored in a mode of a first command ipmitool sel save. If the down state is detected, the down state can be saved by a second command, I, limit-I, lan plus-H, ip-U, user-Password sel.

The sel log is a system event log (system event log) and is obtained in an ipmitool mode, wherein the ipmitool is a management tool under linux; exceptions in the test process are logged in the sel log, so sel is the key log for system trigger logging. Log is a way to save and view sel log locally; in the second command, ipmitool-I samples-H is a system ip address, -U is a user name, -P is a password, and sel save sel.

The dividing of the memory ECC problem into a recoverable ECC problem and an unrepairable ECC problem according to the test log is specifically:

Log may distinguish between a repairable ECC issue and a non-repairable ECC issue by the fifth field in sel. The repairable flag field is 0xa0, and the unrepairable flag field is 0xa1.

In step S3, the memory capable of repairing the ECC problem is located, and the number of times of error reporting of the memory is counted, where the memory capable of repairing the ECC problem is located to a specific memory bank (dimm), and the specific implementation manner may be a third command location lighting, for example: ipmitool raw 0x3a0xb1 dimm 1; if the system is in the downtime state, a fourth command can be sent to position and light the lamp: ipsmool-I lan plus-H ip-U user-P password raw 0x3a0xb1 dimm 1.

In steps S4 to S6, it is determined whether the number of times of memory error reporting exceeds a preset value, and for a recoverable ECC problem: if the memory error reporting times do not exceed the preset value, the memory error reporting may be caused by other reasons (not the reason of the memory bank), the client server system is restarted to test whether the memory error reporting occurs again, and if the memory error reporting does not occur, the recoverable ECC problem is ignored; if so, the memory bank in which the ECC-repairable problem occurs can be repaired by modifying the BIOS option. If the error reporting times of the Memory exceed a preset value, a BIOS option can be exported by using a BIOS tool, and an Enhanced Memory Test (Enhanced Memory Test) in the BIOS option is modified into a Test and Repair (Test and Repair) option; and after the modification is finished, the BIOS is introduced again, the server is restarted after the BIOS option is successfully modified, and automatic repair can be carried out in the restarting process of the server. After the repair is completed, the Operating System (OS) is automatically entered.

Specifically, the BIOS tool may be a SCELNX _64 tool, or other tools, and the invention is not limited thereto. The BIOS (Basic Input Output System ) needs to support the memory enhancement function.

As shown in fig. 3, the method for locating and repairing the ECC problem in the memory according to the present invention further includes:

and S7, positioning the memory with the unrepairable ECC problem, and analyzing the cause of the problem.

In step S7, the memory with the uncorrectable ECC problem is located, which part is caused by the memory is determined, and after the corresponding hardware is replaced, retesting is performed to determine the cause of the uncorrectable ECC problem.

It should be noted that, in the technical solution of the present invention, steps S1 to S7 can all be implemented by programming in a programming language, and the programming idea corresponds to the steps of the present invention, and can also be implemented in other ways, and the present invention is not limited herein.

The ECC problem generated in the memory test is classified into the recoverable ECC problem and the unrepairable ECC problem, the memory bank with the recoverable ECC problem is repaired by modifying the BIOS option, the quality of the server memory is improved, and the reliability of the ECC problem test of the server memory is effectively improved.

According to the technical scheme, the memory bank with ECC problems can be automatically repaired in the restarting process of the server by modifying the BIOS option, and the repairing efficiency of the memory bank is improved.

According to the technical scheme, if the error reporting times of the memory do not exceed the preset value, the client server is restarted, the memory ECC problem test is carried out again, error positioning caused by the ECC problem not caused by the memory is avoided, and the reliability of the memory ECC problem test is improved.

Example two

As shown in fig. 4, the technical solution of the present invention further provides a device for locating and repairing an ECC problem in a memory, including:

the testing module 101 is used for connecting the client server to be tested with the server in a communication way, and the server controls a plurality of client servers to be tested to test the memory ECC problem;

the dividing module 102 is used for the server to obtain a test log of the client server to be tested, and dividing the memory ECC problem into a recoverable ECC problem and an unrepairable ECC problem according to the test log;

and the positioning and repairing module 103 is used for positioning the memory capable of repairing the ECC problem, counting the error reporting times of the memory, and automatically repairing the positioned memory bank capable of repairing the ECC problem by modifying the BIOS option if the error reporting times of the memory exceed a preset value.

According to the technical scheme, the BIOS option is modified, the memory bank with ECC problems can be automatically repaired in the restarting process of the server, and the efficiency of repairing the memory bank is improved.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A method for locating and repairing memory ECC problems is characterized by comprising the following steps:

the method comprises the steps that a server-side server obtains a test log of a client-side server to be tested, and the memory ECC problem is divided into a recoverable ECC problem and a non-recoverable ECC problem according to the test log;

positioning a memory capable of repairing the ECC problem, counting the error reporting times of the memory, and automatically repairing the positioned memory bank capable of repairing the ECC problem by modifying the BIOS option if the error reporting times of the memory exceed a preset value; wherein, through modifying the BIOS option, automatically repairing the located memory bank that can repair the ECC problem specifically includes:

2. The method of claim 1, wherein the step of communicatively connecting the client server to be tested with the server, the step of controlling the plurality of client servers to be tested by the server to perform the testing of the memory ECC problem specifically comprises:

each client server is respectively started up and started up, the actual use scene of a user is simulated, and the memory ECC problem test is carried out;

3. The method of claim 1, wherein the dividing of the memory ECC problem into repairable ECC problem and unrepairable ECC problem according to the test log comprises:

4. The method of claim 3, wherein the repairable flag field is 0xa0 and the non-repairable flag field is 0xa1.

5. The method for locating and repairing the memory ECC problem of claim 1, wherein the memory enhancement Test option is EnhancedMemoryTest, and the Test Repair option is Test and Repair.

6. The method of claim 1, wherein the BIOS supports memory enhancement.

7. The method of claim 1, further comprising: and if the error reporting times of the memory do not exceed the preset value, restarting the client server and carrying out the memory ECC problem test again.

8. The method of claim 1, further comprising: and positioning the memory with the uncorrectable ECC problem and analyzing the cause of the problem.

9. A positioning repair device for memory ECC problem is characterized by comprising:

the server side server acquires a test log of the client side server to be tested, and divides the memory ECC problem into a repairable ECC problem and an unrepairable ECC problem according to the test log;

the positioning and repairing module is used for positioning the memory capable of repairing the ECC problem, counting the error reporting times of the memory, and automatically repairing the positioned memory bank capable of repairing the ECC problem by modifying the BIOS option if the error reporting times of the memory exceed a preset value; wherein, through modifying the BIOS option, automatically repairing the located memory bank that can repair the ECC problem specifically includes:

using a BIOS tool to derive a BIOS option, and modifying a memory enhancement test option in the BIOS option into a test repair option;