CN117992286A - Memory fault handling method, system, storage medium and terminal - Google Patents

Memory fault handling method, system, storage medium and terminal Download PDF

Info

Publication number
CN117992286A
CN117992286A CN202211329257.2A CN202211329257A CN117992286A CN 117992286 A CN117992286 A CN 117992286A CN 202211329257 A CN202211329257 A CN 202211329257A CN 117992286 A CN117992286 A CN 117992286A
Authority
CN
China
Prior art keywords
memory
fault
uncorrectable
target host
faults
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211329257.2A
Other languages
Chinese (zh)
Inventor
高仲于
李诗逸
刁家庆
丁辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202211329257.2A priority Critical patent/CN117992286A/en
Publication of CN117992286A publication Critical patent/CN117992286A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Hardware Design (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The application provides a method for disposing uncorrectable memory faults, which comprises the following steps: when the uncorrectable memory fault of the target host is detected, determining a corresponding hardware fault log; analyzing error reporting content of uncorrectable memory faults in a hardware fault log to obtain a physical address of the memory faults; determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address; generating uncorrectable memory fault warning information containing slot positions; the alert information is used to handle uncorrectable memory failures. The application can position the slot position of the fault memory bank in time, improve the fault handling efficiency when uncorrectable memory faults occur, reduce the risk of system downtime possibly caused by the memory bank faults, and ensure the stable operation of a host and the system. The application also provides a treatment system, a computer readable storage medium and a terminal for uncorrectable memory faults, which have the beneficial effects.

Description

Memory fault handling method, system, storage medium and terminal
Technical Field
The present application relates to the field of network security, and in particular, to a method, a system, a storage medium, and a terminal for handling a memory failure.
Background
Currently, for the processing of uncorrectable error memory faults (also called UE memory faults, UE, english, all called Uncorrected error), the operating system has no sense on the memory isolation recovery policy of the BIOS firmware layer, so that the operating system and the BIOS layer cannot be closely matched or even collide, and the early optimal memory fault isolation prevention period is missed, so that the probability of downtime of the operating system and abnormal stop of upper layer application caused by the memory faults is higher. Meanwhile, the fault detection of the error memory cannot be corrected, the memory bank position of the UE cannot be positioned quickly, and the risk of downtime of the system is difficult to eliminate in time.
Therefore, how to effectively implement timely handling of uncorrectable error memory failures is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide a method for disposing uncorrectable memory faults, a system for disposing uncorrectable memory faults, a storage medium and a terminal, which can rapidly locate the positions of memory strips with faults and timely eliminate the memory faults.
In order to solve the technical problems, the application provides a disposal method for uncorrectable memory faults, which comprises the following specific technical scheme:
when the uncorrectable memory faults of the target host are detected, determining corresponding hardware fault logs;
analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address;
Writing the physical address of the memory fault into an EDAC debugging interface, and mapping to obtain the slot position corresponding to the memory bank with the uncorrectable memory fault;
generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
Optionally, the method further comprises:
and if the memory fault is a correctable memory fault which does not reach a preset response threshold, executing silent isolation recovery on the memory fault.
Optionally, when detecting that the target host has an uncorrectable memory failure, the method further includes:
performing memory fault analysis on the target host to obtain an analysis result at least comprising the number of memory faults;
And if the number of the memory faults is larger than a preset threshold, refusing to start the new service on the target host according to the analysis result, and migrating the current service on the target host to a normal host.
Optionally, the method further comprises:
acquiring the equipment model of a target host;
If the target host of the equipment model supports the configuration information to be exported, the firmware configuration information of the target host is exported, and the firmware memory fault handling option is optimized according to the firmware configuration information;
if the target host of the equipment model does not support configuration information to be exported, analyzing a register value of the target host, and optimizing the firmware memory fault handling option according to the register value;
Wherein the firmware memory failure handling options include at least one of whether the server supports UE recovery, whether the server masks a correctable memory failure signal, and whether the server turns on a failure scan.
Optionally, after determining the corresponding hardware fault log, the method further includes:
If the target host has a downtime record or a restarting record, acquiring a system event log in the baseboard management controller;
Analyzing the system event log, marking the target host as a fault state if an internal error fault of the processor exists, adjusting an application layer service deployment strategy of the target host, and generating prompt information for replacing the memory bank.
Optionally, determining, according to the physical address of the memory failure, a slot position corresponding to the memory bank of the uncorrectable memory failure includes:
writing the physical address of the memory fault into an EDAC debugging interface, and mapping to obtain the slot position corresponding to the memory bank with the uncorrectable memory fault.
The application also provides a disposal system for uncorrectable memory failures, comprising:
the log determining module is used for determining a corresponding hardware fault log when the uncorrectable memory fault of the target host is detected;
the address analysis module is used for analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain the physical address of the memory faults;
The slot determining module is used for writing the memory fault physical address into an EDAC debugging interface, and mapping to obtain a slot position corresponding to the memory bank with the uncorrectable memory fault;
The alarm module is used for generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
Optionally, the method further comprises:
The fault type detection module is used for determining the fault type of the memory fault when the memory fault is detected;
The host marking module is used for marking a target host with the memory fault as a non-health state if the memory fault is a correctable memory fault reaching a preset response threshold or is the uncorrectable memory fault;
And the application layer strategy adjustment module is used for adjusting the application layer service deployment strategy of the target host in the unhealthy state.
The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method as described above.
The application also provides a terminal comprising a memory in which a computer program is stored and a processor which when calling the computer program in the memory implements the steps of the method as described above.
The application provides a method for disposing uncorrectable memory faults, which comprises the following steps: when the uncorrectable memory fault of the target host is detected, determining a corresponding hardware fault log; analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address; determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address; generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
When uncorrectable memory faults occur, the fault reporting content is analyzed from the hardware fault log to obtain the physical address of the memory faults, and error detection and correction information cannot be recorded in the kernel log due to the fact that the uncorrectable memory faults usually occur, and the memory faults, particularly which memory faults occur, cannot be directly determined.
The application also provides a treatment system, a computer-readable storage medium and a terminal for uncorrectable memory faults, which have the beneficial effects and are not repeated here.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for handling uncorrectable memory failures according to an embodiment of the present application;
FIG. 2 is a flow chart of service deployment adjustment in the event of a memory failure according to the present application;
FIG. 3 is a flowchart of another method for handling uncorrectable memory failures according to an embodiment of the present application;
FIG. 4 is a flowchart of another method for handling uncorrectable memory failures according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for configuring firmware of a target host according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of a handling system for uncorrectable memory failures according to an embodiment of the present application:
fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, fig. 1 is a flowchart of a method for handling uncorrectable memory failures according to an embodiment of the present application, where the method includes:
S101: when the uncorrectable memory faults of the target host are detected, determining corresponding hardware fault logs;
This step is intended to determine a hardware fault log, i.e., mcelog logs, recorded by the target host upon determining that an uncorrectable memory fault has occurred at the target host. The hardware fault log is automatically generated by the target host when a memory fault occurs. It should be noted that an uncorrectable memory failure belongs to one of memory failures, which is usually a multi-bit data error, and easily causes a downtime of a system or an application to stop running. The other is a correctable error memory fault (also called CE memory fault, english full Corrected error), which is typically a single bit data error, the system or application will not normally stop running due to the correctable error memory fault, and existing fault tolerant corrective measures can recover the single bit data error. The step is not limited to how to identify whether the memory failure is an uncorrectable memory failure, and the steps corresponding to the embodiment are executed only when the uncorrectable memory failure is determined to occur, and whether the uncorrectable memory failure occurs at this time has no influence on the triggering and execution of the embodiment of the present application.
S102: analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address;
The step aims at analyzing the hardware fault log so as to obtain error reporting content and a memory fault physical address in the error reporting content, wherein the physical address is only the physical address of each memory bank on the current host, but not specific to which memory bank. In particular, an uncorrectable error type memory failure of the SRAR type will not record EDAC (Error Detection And Correction ) information in the kernel log by default, and therefore it is necessary to further determine the slot position of the failed memory bank according to the physical address of the memory failure.
S103: determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address;
the step aims to calculate the slot position of the fault memory bank, and the slot position of the memory bank with uncorrectable memory faults can be determined according to the physical address of the memory fault determined in the step. Specifically, when the step is executed, the slot position of the memory bank with the memory fault can be mapped and obtained by using the EDAC calculation process by writing the physical address of the memory fault in the error report content into the EDAC debugging interface.
Here, it is not limited how to output the slot position, and an output format similar to that of a correctable memory failure, for example, cpu_srcid#1_mc#1_chan#0_dimm#1, may be used, in which slot number information is included.
S104: generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
After determining the slot position, generating uncorrectable memory failure alarm information containing the slot position, where the uncorrectable memory failure alarm information may also include other contents, for example, the number and type of memory failures existing in the target host, or include the correctable memory failures existing in the current target host, so that the user can determine the current memory state of the target host and take corresponding disposal measures. The treatment is not limited to any particular method, and may include, but is not limited to, detection and replacement of the memory bank.
When uncorrectable memory faults occur, error reporting content is analyzed from a hardware fault log to obtain a memory fault physical address, error detection and correction information cannot be recorded in a kernel log due to the fact that uncorrectable memory faults usually occur, and the fact that the uncorrectable memory faults occur, particularly which memory bank cannot be directly determined.
On the basis of the above embodiment, as a preferred implementation manner, before determining the corresponding hardware fault log, analysis and judgment may be further performed on the detected memory fault, so that the treatment policy of the memory fault is adjusted based on the operating system and the application layer, a specific process may be shown in fig. 2, and fig. 2 is a service deployment adjustment flowchart in the case of a memory fault provided by the present application, where the specific process is as follows:
S201: when a memory fault is detected, determining the fault type of the memory fault;
s202: if the memory fault is a correctable memory fault reaching a preset response threshold or is the uncorrectable memory fault, marking a target host with the memory fault as a non-healthy state;
s203: and adjusting an application layer service deployment strategy of the target host in the unhealthy state.
The preset response threshold is not specifically limited, and the response threshold is usually a number of thresholds, and may also be other thresholds reflecting the severity of the fault, so as to reflect the severity of the current memory fault. For a correctable memory failure and an uncorrectable memory failure, since the two failures are seriously different, a preset response threshold is usually set for the correctable memory failure, and a threshold may not be set for the uncorrectable memory failure.
In addition, if the memory failure is a correctable memory failure which does not reach the preset response threshold, the silence isolation recovery is performed on the memory failure without adjusting the application layer service deployment strategy.
Once the correctable memory failure meets a preset response threshold, or the memory failure is not correctable, marking the corresponding target host as unhealthy, and adjusting its application layer service deployment policy.
So-called application layer service deployment policies primarily affect the services that the target host currently has deployed and the application services that are to be deployed.
One possible process for adjusting the application layer service deployment policy may be as follows: and firstly, performing memory fault analysis on the target host to obtain an analysis result at least comprising the number of memory faults, judging according to the analysis result, and if the number of memory faults is larger than a preset threshold value, refusing to start new service on the target host according to the analysis result, namely, the target host does not receive new application service processes any more, so that memory operation is reduced, adverse effects on the target host and a system caused by the aggravated memory faults are avoided, and meanwhile, the current service on the target host can be migrated to a normal host.
The embodiment can solve the defect that the current operating system does not sense the memory isolation recovery strategy of the BIOS firmware layer, and the defect easily causes that the operating system and the BIOS firmware layer cannot be closely matched or even collide, so that the optimal memory fault recovery strategy of the system cannot be achieved, the early optimal memory fault isolation prevention period is missed, and the probability of downtime of the operating system and abnormal stopping operation of upper-layer applications caused by the memory fault is higher. The embodiment of the application realizes the evaluation of the health state of the target host based on the memory fault, thereby timely adjusting the application layer service deployment strategy of the target host and ensuring the high reliability of the service of the cluster where the target host is located.
Based on the foregoing embodiments, as a preferred embodiment, after determining the corresponding hardware fault log, the target host may also be detected as downtime or restarting. If the target host has a downtime record or a restarting record, a system event log in the baseboard management controller is obtained, the system event log is analyzed, if the target host has an internal error fault of the processor, the target host is marked as a fault state, an application layer service deployment strategy of the target host is adjusted, and prompt information for replacing the memory bank is generated.
At this time, the complete detection process corresponding to the embodiment may be shown in fig. 3, and fig. 3 is a flowchart of another method for handling uncorrectable memory failures provided in the embodiment of the present application, which specifically includes the following steps:
S301: when the uncorrectable memory faults of the target host are detected, determining corresponding hardware fault logs; if the target host has a downtime record or a restarting record, the method proceeds to S302; if not, go to S303;
S302: acquiring a system event log in a baseboard management controller, analyzing the system event log, marking a target host as a fault state if an internal error fault of a processor exists, adjusting an application layer service deployment strategy of the target host, generating prompt information for replacing a memory bank, and ending the flow;
s303: analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address;
S304: determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address;
s305: and generating alarm information containing the slot position.
Based on the above embodiments, referring to fig. 4, fig. 4 is a flowchart of another method for handling uncorrectable memory failures according to an embodiment of the present application, where a complete memory failure handling process is obtained by combining the embodiments as follows:
s401: when a memory fault is detected, determining the fault type of the memory fault;
s402: if the memory fault is a correctable memory fault reaching a preset response threshold or is the uncorrectable memory fault, marking a target host with the memory fault as a non-healthy state;
S403: adjusting an application layer service deployment strategy of the target host in the unhealthy state;
s404: determining a corresponding hardware fault log; if the target host has a downtime record or a restarting record, the method proceeds to S405; if not, go to S406;
S405: acquiring a system event log in a baseboard management controller, analyzing the system event log, marking a target host as a fault state if an internal error fault of a processor exists, adjusting an application layer service deployment strategy of the target host, generating prompt information for replacing a memory bank, and ending the flow;
s406: analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address;
S407: determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address;
s408: and generating alarm information containing the slot position.
Based on the above embodiments, in order to further reduce the influence of the memory failure on the host, reduce the memory application conflict between the operating system and the BIOS layer on the target host, a memory failure recovery policy may be configured for the target host, so as to implement prevention of the memory failure and efficient handling when the memory failure occurs. Referring to fig. 5, fig. 5 is a flowchart of a firmware configuration method of a target host according to an embodiment of the present application, and a possible configuration process may be as follows:
s501: acquiring the equipment model of a target host;
S502: if the target host of the equipment model supports the configuration information to be exported, the firmware configuration information of the target host is exported, and the firmware memory fault handling option is optimized according to the firmware configuration information;
S503: if the target host of the equipment model does not support configuration information to be exported, analyzing a register value of the target host, and optimizing the firmware memory fault handling option according to the register value;
The embodiment aims at analyzing the target host and optimizing the memory configuration of the target host from the viewpoint of an operating system. Specifically, the device model of the target host is obtained first, and the device model can be obtained directly through the operating system of the target host, and whether the host of the device model supports the export of configuration information is judged. The configuration information mainly refers to firmware configuration information of the target host, including but not limited to firmware such as a memory.
If the configuration information export is supported, the firmware configuration information can be directly exported, and the firmware memory fault handling option is optimized according to the firmware configuration information. The firmware memory failure handling options include at least one of whether the server supports UE recovery, whether the server masks the correctable memory failure signal, and whether the server turns on a failure scan, although handling options may also be included, which are not limited herein by way of example. Taking the above three options as examples, whether the server supports UE recovery is preferably set to yes, whether the server masks the correctable memory failure signal is preferably set to no, and whether the server starts failure scanning is preferably set to yes.
If the target host does not support configuration information export, detecting register values of the configuration information in the current state, and setting the register values as recommended values. For example, the values of the CPU MSR register and IMC register may be checked to determine the recovery support of the server for memory failures. By recommended value, we mean an option value that facilitates timely response and timely handling of memory failures. After the recommended value is set, whether the modification is successful or not can be detected, and if the modification is failed, the user is guided to manually configure the device. If the modification is successful, the relevant configuration of the firmware fault recovery of the target host is optimal, and the process is ended.
The present embodiment and the above embodiments are independent, and the target host may be configured in any period of time.
The following are several exemplary register and memory failure recovery functions:
(1) MSR_MCG_ CONTAIN (0 x178 h) register value Poison _en bits, resolves whether the server supports UE recovery.
(2) The IA32_MCG_CAP (0 x179 h) register CMCI_EN bit and MCx_CTL2MSR (0 x280h-0x29 Fh) [30:30] CMCI_EN bit, parse the server if the memory CE fault signal is masked.
(3) IMC REGISTER (0 x914 h) Scrub_en bit, parse whether the server starts active scanning
Of course, on the basis of the embodiment, when the target host is detected to need to execute firmware configuration optimization, corresponding prompt information or guide document can be generated, so that the user can manually adjust the firmware configuration.
The following describes a handling system for uncorrectable memory failures provided by an embodiment of the present application, where the handling system described below and the handling method for uncorrectable memory failures described above may be referred to correspondingly.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an uncorrectable memory failure handling system according to an embodiment of the present application, and the present application further provides an uncorrectable memory failure handling system, including:
the log determining module is used for determining a corresponding hardware fault log when the uncorrectable memory fault of the target host is detected;
the address analysis module is used for analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain the physical address of the memory faults;
The slot determining module is used for determining the slot position corresponding to the memory bank where the uncorrectable memory fault occurs according to the memory fault physical address;
The alarm module is used for generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
Based on the above embodiment, as a preferred embodiment, further comprising:
The fault type detection module is used for determining the fault type of the memory fault when the memory fault is detected;
The host marking module is used for marking a target host with the memory fault as a non-health state if the memory fault is a correctable memory fault reaching a preset response threshold or is the uncorrectable memory fault;
And the application layer strategy adjustment module is used for adjusting the application layer service deployment strategy of the target host in the unhealthy state.
Based on the above embodiment, as a preferred embodiment, further comprising:
And the isolation recovery module is used for executing silent isolation recovery on the memory fault when the memory fault is a correctable memory fault which does not reach a preset response threshold.
Based on the above embodiment, as a preferred embodiment, the application layer policy adjustment module is configured to perform memory failure analysis on the target host, to obtain an analysis result at least including the number of memory failures; and if the number of the memory faults is larger than a preset threshold, refusing to start the new service on the target host according to the analysis result, and migrating the current service on the target host to a normal host.
Based on the above embodiment, as a preferred embodiment, further comprising:
The firmware configuration module is used for acquiring the equipment model of the target host; if the target host of the equipment model supports the configuration information to be exported, the firmware configuration information of the target host is exported, and the firmware memory fault handling option is optimized according to the firmware configuration information; if the target host of the equipment model does not support configuration information to be exported, analyzing a register value of the target host, and optimizing the firmware memory fault handling option according to the register value; wherein the firmware memory failure handling options include at least one of whether the server supports UE recovery, whether the server masks a correctable memory failure signal, and whether the server turns on a failure scan.
Based on the above embodiment, as a preferred embodiment, further comprising:
The equipment abnormality detection module is used for acquiring a system event log in the baseboard management controller if the target host has a downtime record or a restarting record; analyzing the system event log, marking the target host as a fault state if an internal error fault of the processor exists, adjusting an application layer service deployment strategy of the target host, and generating prompt information for replacing the memory bank.
Based on the foregoing embodiments, as a preferred embodiment, the slot determining module includes:
and the slot position mapping unit is used for writing the memory fault physical address into the EDAC debugging interface and mapping to obtain the slot position corresponding to the memory bank with the uncorrectable memory fault.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, can implement the steps of the method provided by the above-described embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The application also provides a terminal, which can comprise a memory and a processor, wherein the memory stores a computer program, and the processor can realize the steps of the method for disposing uncorrectable memory faults provided by the embodiment when calling the computer program in the memory. The terminal may of course also comprise various network interfaces, power supplies, etc. Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application, where the terminal in this embodiment may include: a processor 2101 and a memory 2102.
Optionally, the terminal may also include a communication interface 2103, an input unit 2104 and a display 2105 and a communication bus 2106.
The processor 2101, memory 2102, communication interface 2103, input unit 2104, display 2105, and all communicate with each other via communication bus 2106.
In an embodiment of the present application, the processor 2101 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, an off-the-shelf programmable gate array, or other programmable logic device.
The processor may call a program stored in the memory 2102. In particular, the processor may perform the operations performed by the terminal in the above embodiments.
The memory 2102 is used to store one or more programs, and the programs may include program code that includes computer operation instructions, and in an embodiment of the present application, at least the programs for implementing the following functions are stored in the memory:
when the uncorrectable memory faults of the target host are detected, determining corresponding hardware fault logs;
analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address;
Determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address;
generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
In one possible implementation, the memory 2102 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, and at least one application program required for functions (such as topic detection functions, etc.), and the like; the storage data area may store data created during use of the computer.
In addition, memory 2102 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid state storage device.
The communication interface 2103 may be an interface of a communication module, such as an interface of a GSM module.
The application may also include a display 2105 and an input unit 2104, etc.
The structure of the terminal shown in fig. 7 is not limited to the terminal in the embodiment of the present application, and the terminal may include more or less components than those shown in fig. 7 or may combine some components in practical applications.
In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The system provided by the embodiment is relatively simple to describe as it corresponds to the method provided by the embodiment, and the relevant points are referred to in the description of the method section.
The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for handling a memory failure, comprising:
when the uncorrectable memory fault of the target host is detected, determining a corresponding hardware fault log;
analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain a memory fault physical address;
Determining the slot position corresponding to the memory bank with the uncorrectable memory fault according to the memory fault physical address;
generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
2. The method of treatment according to claim 1, further comprising:
and if the memory fault is a correctable memory fault which does not reach a preset response threshold, executing silent isolation recovery on the memory fault.
3. The method of disposal of claim 2, wherein when detecting that the target host has an uncorrectable memory failure, further comprising:
performing memory fault analysis on the target host to obtain an analysis result at least comprising the number of memory faults;
And if the number of the memory faults is larger than a preset threshold, refusing to start the new service on the target host according to the analysis result, and migrating the current service on the target host to a normal host.
4. The method of treatment according to claim 1, further comprising:
acquiring the equipment model of a target host;
If the target host of the equipment model supports the configuration information to be exported, the firmware configuration information of the target host is exported, and the firmware memory fault handling option is optimized according to the firmware configuration information;
if the target host of the equipment model does not support configuration information to be exported, analyzing a register value of the target host, and optimizing the firmware memory fault handling option according to the register value;
Wherein the firmware memory failure handling options include at least one of whether the server supports UE recovery, whether the server masks a correctable memory failure signal, and whether the server turns on a failure scan.
5. The method of handling of claim 2, further comprising, after determining the corresponding hardware fault log:
If the target host has a downtime record or a restarting record, acquiring a system event log in the baseboard management controller;
Analyzing the system event log, marking the target host as a fault state if an internal error fault of the processor exists, adjusting an application layer service deployment strategy of the target host, and generating prompt information for replacing the memory bank.
6. The method of claim 1, wherein determining a slot location corresponding to the memory bank of the uncorrectable memory failure from the memory failure physical address comprises:
writing the physical address of the memory fault into an EDAC debugging interface, and mapping to obtain the slot position corresponding to the memory bank with the uncorrectable memory fault.
7. An uncorrectable memory failure handling system, comprising:
the log determining module is used for determining a corresponding hardware fault log when the uncorrectable memory fault of the target host is detected;
the address analysis module is used for analyzing the error reporting content of the uncorrectable memory faults in the hardware fault log to obtain the physical address of the memory faults;
The slot determining module is used for determining the slot position corresponding to the memory bank where the uncorrectable memory fault occurs according to the memory fault physical address;
The alarm module is used for generating alarm information containing the slot position; the alarm information is used for disposing of the uncorrectable memory failure.
8. The treatment system of claim 7, further comprising:
The fault type detection module is used for determining the fault type of the memory fault when the memory fault is detected;
The host marking module is used for marking a target host with the memory fault as a non-health state if the memory fault is a correctable memory fault reaching a preset response threshold or is the uncorrectable memory fault;
And the application layer strategy adjustment module is used for adjusting the application layer service deployment strategy of the target host in the unhealthy state.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of handling uncorrectable memory failures according to any one of claims 1-6.
10. A terminal comprising a memory and a processor, wherein the memory has a computer program stored therein, and wherein the processor, when calling the computer program in the memory, implements the steps of the method for handling uncorrectable memory failures according to any one of claims 1-6.
CN202211329257.2A 2022-10-27 2022-10-27 Memory fault handling method, system, storage medium and terminal Pending CN117992286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211329257.2A CN117992286A (en) 2022-10-27 2022-10-27 Memory fault handling method, system, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211329257.2A CN117992286A (en) 2022-10-27 2022-10-27 Memory fault handling method, system, storage medium and terminal

Publications (1)

Publication Number Publication Date
CN117992286A true CN117992286A (en) 2024-05-07

Family

ID=90894680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211329257.2A Pending CN117992286A (en) 2022-10-27 2022-10-27 Memory fault handling method, system, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN117992286A (en)

Similar Documents

Publication Publication Date Title
US7451387B2 (en) Autonomous method and apparatus for mitigating soft-errors in integrated circuit memory storage devices at run-time
WO2017215377A1 (en) Method and device for processing hard memory error
US20140188829A1 (en) Technologies for providing deferred error records to an error handler
CN111625387B (en) Memory error processing method, device and server
US9645904B2 (en) Dynamic cache row fail accumulation due to catastrophic failure
US20130339823A1 (en) Bad wordline/array detection in memory
EP2787440A1 (en) Information processing device, program, and method
US11853150B2 (en) Method and device for detecting memory downgrade error
CN112002370B (en) Method and device for identifying disk abnormity and distributed storage system
CN104685474A (en) Notification of address range including non-correctable error
CN110609778A (en) Method and system for storing server downtime log
CN111221775B (en) Processor, cache processing method and electronic equipment
CN114461436A (en) Memory fault processing method and device and computer readable storage medium
CN112395122A (en) Flash memory controller and method thereof
CN115016963A (en) Memory page isolation method, memory monitoring system and computer readable storage medium
CN115794472A (en) Chip error collection and processing method, device and storage medium
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN116643906B (en) Cloud platform fault processing method and device, electronic equipment and storage medium
CN111124818B (en) Monitoring method, device and equipment for Expander
CN117992286A (en) Memory fault handling method, system, storage medium and terminal
CN117472623A (en) Method, device, equipment and storage medium for processing memory fault
JP3711871B2 (en) PCI bus failure analysis method
CN116719657A (en) Firmware fault log generation method, device, server and readable medium
CN113625957B (en) Method, device and equipment for detecting hard disk faults
CN114911659A (en) CE storm suppression method, device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination