CN112231128A - Memory error processing method and device, computer equipment and storage medium - Google Patents

Memory error processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112231128A
CN112231128A CN202010951988.5A CN202010951988A CN112231128A CN 112231128 A CN112231128 A CN 112231128A CN 202010951988 A CN202010951988 A CN 202010951988A CN 112231128 A CN112231128 A CN 112231128A
Authority
CN
China
Prior art keywords
memory
error
correctable
server
errors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010951988.5A
Other languages
Chinese (zh)
Other versions
CN112231128B (en
Inventor
胡金富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Controllable Information Industry Co Ltd
Original Assignee
Zhongke Controllable Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Controllable Information Industry Co Ltd filed Critical Zhongke Controllable Information Industry Co Ltd
Priority to CN202010951988.5A priority Critical patent/CN112231128B/en
Publication of CN112231128A publication Critical patent/CN112231128A/en
Application granted granted Critical
Publication of CN112231128B publication Critical patent/CN112231128B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a memory error processing method, a memory error processing device, computer equipment and a storage medium. The method comprises the following steps: under the condition that the correctable memory errors of the server are monitored, the correctable memory errors are classified and summarized to obtain memory error information, a target error processing mode of the server is obtained, if the target error processing mode is a reporting error mode, the memory error information is displayed, and if the target error processing mode is an isolating error mode, the memory unit where the correctable memory errors are located is isolated. The server can isolate the unit where the correctable memory error is located when the user selects an isolation error processing mode, and the problems that the server crashes and crashes due to the fact that the number of the correctable memory errors is converted into the uncorrectable memory errors under the condition that the correctable memory errors are always triggered are avoided.

Description

Memory error processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a memory error processing method and apparatus, a computer device, and a storage medium.
Background
With the gradual popularization of cloud computing and big data and the remarkable improvement of data computing capacity, the requirements on the stability and the reliability of the server in a complex environment are higher and higher. Therefore, a Central Processing Unit (CPU) manufacturer provides a function (RAS) for improving the stability of a server product, and the RAS function provides the capability of detecting errors, correcting errors and reconfiguring a system for components such as CPU internal components, a memory, PCIe and the like.
In the index of evaluating whether the server is reliable, the memory problem is always an important influence factor of stability and reliability of the server. The influence of the change of the server operating environment on the quality of the memory signals and the difference of the memory quality are main sources of memory errors in the server. Memory errors are divided into correctable errors and uncorrectable errors, however, a large number of correctable errors may cause system performance degradation and may also derive as uncorrectable errors causing a server crash restart. Therefore, the RAS function of the server provides a monitoring function for correctable memory errors so as to find out the correctable memory errors in time for processing.
However, the function of monitoring correctable memory errors provided by the conventional RAS is only to let the user know that a correctable memory error occurs, and under the condition that the correctable memory error is always triggered, the correctable memory error is easily accumulated and converted into an uncorrectable memory error, so that the stability and reliability of the server are greatly reduced.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a memory error processing method, device, computer device, and storage medium that can effectively improve the reliability of a server.
In a first aspect, a method for processing a memory error includes:
under the condition that a correctable memory error occurs in a server, classifying and summarizing the correctable memory error to obtain memory error information;
acquiring a target error processing mode of the server; the target error processing mode is determined according to a user selection instruction when the server is started;
if the target error processing mode is a reporting error mode, displaying the memory error information;
and if the target error processing mode is an error isolation mode, isolating the memory unit where the correctable memory error is located.
In the prior art, the server reports the correctable memory errors directly after monitoring that the correctable memory errors occur, so that the server only informs a user that the memory errors occur when reporting the memory error information, and the memory error processing method provided by the embodiment further analyzes the correctable memory errors when monitoring that the correctable memory errors occur, so as to sort and summarize the correctable memory errors, and obtain more detailed memory error information, for example, information such as the position, the number, the time and the like of the correctable memory errors can be obtained, so that the server can inform the user of more detailed memory error information when reporting the memory error information. In addition, the server can also perform isolation processing on the memory unit where the correctable memory error is located when the user selects the isolation error processing mode, so that the problem that the server crashes and crashes due to the fact that the accumulated quantity of the correctable memory errors is converted into the uncorrectable memory errors under the condition that the correctable memory errors are always triggered is avoided, and therefore the reliability of the server is greatly improved by the memory error processing method provided by the application.
In one embodiment, the memory error information includes a location of a memory cell, a number of correctable memory errors occurring in the memory cell, and an occurrence time of the correctable memory errors.
In one embodiment, the classifying and summarizing the correctable memory errors to obtain the memory error information when the correctable memory errors are monitored by the server includes:
under the condition that a correctable memory error occurs in the server, acquiring a memory unit where the correctable memory error is located and occurrence time;
counting the number of correctable memory errors occurring in each memory unit according to the memory unit where the correctable memory errors are located;
and generating the memory error information according to the position of each memory unit, the number of correctable memory errors occurring in each memory unit and the occurrence time of each correctable memory error.
The embodiment described above implements the classification and aggregation of correctable memory errors, so as to obtain the memory error information after the classification and aggregation, and enable a user to more clearly know which memory cell the correctable memory error occurs in, the number of correctable memory errors occurring in each memory cell, and the occurrence time of the correctable memory errors.
In one embodiment, the isolating the memory cell in which the correctable memory error is located includes:
judging whether the number of correctable memory errors occurring on each memory unit is greater than a preset isolation number threshold value or not;
and determining the memory units with the number of the memory errors larger than the isolation number threshold value as problem memory units, and isolating the problem memory units.
The method can prevent the probability of converting the correctable memory errors into the uncorrectable memory errors from increasing due to the fact that the number of the correctable memory errors in the memory units with the problems is increased sharply, and therefore the method provided by the implementation can improve the reliability of the server.
In one embodiment, the isolating the problem memory cell includes:
migrating data in the problem memory unit to an idle memory unit; the idle memory unit is an idle memory unit reserved by the server after the target error processing mode is determined to be the error isolation mode;
and isolating the problem memory unit after the migration.
In the method, after the server determines the problem memory unit, the data in the problem memory unit can be migrated to the spare idle memory unit, so that the safe storage of the useful data in the problem memory unit is ensured, the data loss or damage of the server caused by the problem of the problem memory unit is avoided, and the operation reliability of the server is improved.
In one embodiment, after the isolating the migrated problem memory cell, the method further includes:
and identifying and displaying the isolated problem memory cells.
In the method, after the server identifies and displays the isolated memory units, the server can inform the user of which problem units are isolated, so that the user can clearly know the process of the server for processing the correctable memory errors.
In one embodiment, the displaying the memory error information includes:
and displaying the memory error information on a display screen of the server, and controlling a memory warning lamp on the server to light up.
According to the method, the user can be informed of the possibility of checking the memory error information on the display screen on the server, and the functions of warning and informing of checking the error information are achieved.
In one embodiment, the method further comprises:
when detecting that an isolation key on the server is triggered, detecting whether the memory warning lamp is lighted;
if the memory warning lamp is on, isolating the memory unit where the correctable memory error is located, and turning off the memory warning lamp;
and if the memory warning lamp is not lightened, continuously detecting whether the memory warning lamp is lightened.
The method realizes the method for processing the correctable memory errors by the server through interactive operation with the user, provides the function of allowing the user to process the correctable memory errors, enables the user to intervene in the auxiliary server to process the correctable memory errors, and improves the efficiency of processing the correctable memory errors by the server.
In a second aspect, a memory error handling apparatus, the apparatus comprising:
and the classification and collection module is used for classifying and collecting the correctable memory errors to obtain memory error information under the condition that the occurrence of the correctable memory errors in the server is monitored.
The acquisition module is used for acquiring a target error processing mode of the server; and the target error processing mode is determined according to a user selection instruction when the server is started.
The display module is used for displaying the memory error information under the condition that the target error processing mode is a reporting error mode;
and the isolation module is used for isolating the memory unit where the correctable memory error is located under the condition that the target error processing mode is an isolation error mode.
In a third aspect, a computer device comprises a memory storing a computer program and a processor implementing the method of the first aspect when the processor executes the computer program.
In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method of the first aspect described above.
Drawings
FIG. 1 is a diagram of the internal structure of a server in one embodiment;
FIG. 2 is a flow diagram illustrating a method for memory error handling according to an embodiment;
FIG. 3 is a schematic flow chart illustrating an implementation manner of S101 in the embodiment of FIG. 2;
FIG. 4 is a flowchart illustrating an implementation manner of S104 in the embodiment of FIG. 2;
FIG. 5 is a flowchart illustrating an implementation manner of S302 in the embodiment of FIG. 4;
FIG. 6 is a flow diagram illustrating a method for memory error handling in accordance with an embodiment;
FIG. 7 is a flowchart illustrating an implementation manner of S103 in the embodiment of FIG. 2;
FIG. 8 is a flowchart illustrating a method for memory error handling according to an embodiment;
FIG. 9 is a diagram of an application environment in one embodiment;
FIG. 10 is a block diagram of an embodiment of a memory error handling device;
FIG. 11 is a block diagram of an embodiment of a memory error handling device;
FIG. 12 is a block diagram of an embodiment of a memory error handling device;
FIG. 13 is a block diagram of an embodiment of a memory error handling device;
FIG. 14 is a block diagram of a memory error handling device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The memory error processing method provided by the present application may be applied to a server shown in fig. 1, where the server may be a computer device, and its internal structure diagram may be as shown in fig. 1. The server comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the server is configured to provide computing and control capabilities. The memory of the server comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the server is used for communicating with an external terminal through network connection. The server is executed by the processor to realize a memory error processing method. The display screen of the server can be a liquid crystal display screen or an electronic ink display screen, and the input device of the server can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the server, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, as shown in fig. 2, a memory error handling method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s101, under the condition that the correctable memory errors of the server are monitored, the correctable memory errors are classified and summarized to obtain memory error information.
Specifically, when the server is started, whether a correctable memory error occurs in the memory is monitored in real time, if the correctable memory error occurs, the server can correspondingly extract at least one correctable memory error from the memory or the database recording the memory error, and analyze the position and time of the extracted correctable memory error, so that the correctable memory errors are classified and summarized according to the analysis result, memory units where the correctable memory errors exist are distinguished, and reporting or processing is facilitated later. It should be noted that, when the server executes the step of classifying and summarizing, a data structure may be created first, and then the relevant information of the correctable memory errors that needs to be classified and summarized is input into the data structure as a parameter for classifying and summarizing, so as to obtain the memory error information after the classifying and summarizing.
S102, acquiring a target error processing mode of the server; and the target error processing mode is determined according to the user selection instruction when the server is started.
The error processing method is a processing method adopted by the server when processing the generated correctable memory errors, for example, reporting errors, isolating errors, correcting errors, and the like. The target error processing mode comprises a reporting error mode or an isolating error mode, wherein the reporting error mode is used for indicating the server to show the correctable memory errors to a user or report the correctable memory errors to the user in a voice mode, and the isolating error mode is used for indicating the server to isolate the memory units where the correctable memory errors are located. The user selection instruction is an instruction input by a user when the server is started, and the user selection instruction is used for instructing the server to select a corresponding error processing mode to process the current correctable memory errors.
Specifically, the user may input a user selection instruction on a selection interface popped up when the server is started, and the server may determine the target error handling manner by analyzing the error handling manner identifier included in the user selection instruction, so as to determine which error handling manner the user has selected to handle the upcoming correctable memory error. For example, the user selection instruction may include a report error mode flag indicating that the user selects to process the correctable memory error in a report error mode, or the user selection instruction may include an isolation error mode flag indicating that the user selects to process the correctable memory error in an isolation error mode. It should be noted that, when the user selects to process the correctable memory error in the isolated error manner, the isolated error number threshold may be further input on the selection interface, so that the server may correctly evaluate whether to isolate the memory unit where the correctable memory error is located according to the isolated error number threshold.
S103, if the target error processing mode is a reporting error mode, displaying the memory error information.
The embodiment relates to a specific processing procedure of the server when it is determined that the target error processing mode is the report error mode, that is, when it is determined that the correctable memory errors are processed by the report error mode, the memory error information obtained after the previous classification and aggregation can be directly displayed on a display interface of the server, optionally, the server can also notify a user of the occurrence of the correctable memory errors by lighting a warning lamp, optionally, the server can also notify the user of the occurrence of the correctable memory errors and the content of the memory error information by a voice broadcast mode, which is not limited herein.
S104, if the target error processing mode is the error isolation mode, isolating the memory unit where the correctable memory error is located.
The embodiment relates to a specific processing procedure of a server when a target error processing mode is determined to be an isolation error mode, that is, when the server determines that a correctable memory error is processed by using the isolation error mode, the server may first determine a location where the correctable memory error occurs, that is, which memory cell where the correctable memory error occurs, and then directly isolate a memory cell where the correctable memory error exists.
In the above embodiment, under the condition that it is monitored that a correctable memory error occurs in the server, the correctable memory error is classified and summarized to obtain memory error information, and a target error processing mode of the server is obtained, if the target error processing mode is a reporting error mode, the memory error information is displayed, and if the target error processing mode is an isolating error mode, an isolating process is performed on a memory unit where the correctable memory error exists. In the existing memory error processing method, the server reports the correctable memory error directly after monitoring the correctable memory error, so that the server only informs a user of the memory error when reporting the memory error information, and the memory error processing method provided by the embodiment of the disclosure further analyzes the correctable memory error when monitoring the correctable memory error, and then classifies and summarizes the correctable memory error to obtain more detailed memory error information, for example, information such as the position, the number, the time and the like of the correctable memory error can be obtained, so that the server can inform the user of more detailed memory error information when reporting the memory error information. In addition, the server can also perform isolation processing on the memory unit where the correctable memory error is located when the user selects the isolation error processing mode, so that the problem that the server crashes and crashes due to the fact that the accumulated number of the correctable memory errors is converted into the uncorrectable memory errors under the condition that the correctable memory errors are always triggered is avoided, and therefore the stability and the reliability of the server are greatly improved by the memory error processing method provided by the application.
In practical application, after the correctable memory errors are classified and summarized by the server, the memory error information after the classification and the summarization can be obtained. The classified and summarized memory error information includes the location of the memory cell, the number of correctable memory errors occurring in the memory cell, and the occurrence time of the correctable memory errors. The location of the memory unit refers to which memory unit the correctable memory error occurs, for example, memory unit 1, memory unit 2, memory unit 3, memory unit 4, and memory unit 5 exist in the server, where the correctable memory error occurs on memory unit 1 and memory unit 2.
Based on the memory error information, in an embodiment, the present application further provides an implementation manner of the S101, as shown in fig. 3, where the S101 "classifies and summarizes the correctable memory errors to obtain the memory error information when it is monitored that the correctable memory errors occur in the server," includes:
s201, under the condition that the correctable memory errors of the server are monitored, acquiring memory units where the correctable memory errors are located and occurrence time.
When the server monitors that a plurality of correctable memory errors occur, the memory unit where each correctable memory error is located and the occurrence time of each correctable memory error can be obtained in the memory or the database in which the information related to each correctable memory error is recorded.
S202, according to the memory unit where the correctable memory errors are located, the number of the correctable memory errors occurring in each memory unit is counted.
When the server obtains the memory unit where each correctable memory error is located and the occurrence time, the number of correctable memory errors occurring in each memory unit and the occurrence time of the correctable memory errors in each memory unit can be further counted.
S203, memory error information is generated according to the position of each memory cell, the number of correctable memory errors occurring in each memory cell and the occurrence time of each correctable memory error.
When the server counts the number of correctable memory errors occurring in each memory unit and the occurrence time of each correctable memory error, the position of each memory unit, the number of correctable memory errors occurring in each memory unit and the occurrence time of each correctable memory error are determined as memory error information, and then the memory error information can be stored. Optionally, the server may store the memory error information in a form of a data structure, so that the memory error information is directly obtained through the data structure when used later.
The embodiment described above implements the classification and aggregation of correctable memory errors, so as to obtain the memory error information after the classification and aggregation, and enable a user to more clearly know which memory cell the correctable memory error occurs in, the number of correctable memory errors occurring in each memory cell, and the occurrence time of the correctable memory errors occurring in each memory cell.
In an embodiment, the present application further provides an implementation manner of the foregoing S104, as shown in fig. 4, the "isolating a memory unit where a correctable memory error is located to obtain memory error information" in the foregoing S104 includes:
s301, judging whether the number of the correctable memory errors occurring on each memory unit is larger than a preset isolation number threshold value.
The isolation quantity threshold value can be determined by the server in advance according to the actual application requirements, can also be customized by a user, or is determined by the user according to the condition that the server reports the memory error information. In this embodiment, the determining step is that when the server obtains the number of correctable memory errors occurring in each memory cell, the server may further compare the number of correctable memory errors occurring in each memory cell with a preset isolation number threshold, so as to determine the memory cells in which the number of correctable memory errors is greater than the isolation number threshold.
S302, the memory units with the number of the memory errors larger than the isolation number threshold are determined as problem memory units, and isolation processing is carried out on the problem memory units.
After the server determines the memory units with the number of correctable memory errors larger than the isolation number threshold, the memory units with the number of occurring memory errors larger than the isolation number threshold can be determined as problem memory units, and the problem memory units are directly isolated, so that the probability of increasing and converting the correctable memory errors into uncorrectable memory errors due to the fact that the number of correctable memory errors in the problem memory units is increased sharply is prevented, and therefore the reliability of the server can be improved through the method provided by the implementation. It should be noted that, when the number of the memory errors is not greater than the isolation number threshold, the server does not perform any processing on the memory unit in which the number of the memory errors is not greater than the isolation number threshold.
Further, the present application provides a specific implementation manner for the problem memory unit, as shown in fig. 5, the "performing isolation processing on the problem memory unit" in S302 includes:
s401, migrating data in a problem memory unit to an idle memory unit; the idle memory unit is a memory unit reserved by the server after the target error processing mode is determined to be the error isolation mode.
Specifically, after determining that the target error handling mode is the isolated error handling mode, the server may further force to open, through a memory open function (e.g., RANK _ SPARE), a memory unit that may or may not have a correctable memory error on the server, so that the memory unit is used as an idle memory unit for standby. Therefore, after the server determines the problem memory unit, the data in the problem memory unit can be migrated to the spare idle memory unit, so that the safe storage of the useful data in the problem memory unit is ensured, the data loss or damage of the server caused by the problem of the problem memory unit is avoided, and the operation reliability of the server is improved.
S402, isolating the problem memory unit after the migration.
Specifically, after the server migrates the data in the memory unit in question, the memory unit in question can be isolated to prevent the memory unit in question from affecting the normal operation of other memory units.
Optionally, after the server isolates the migrated problem memory unit, the isolated problem memory unit may be identified and displayed. For example, the server may red-mark the isolated memory cells or mark isolated typefaces, and display the results on the display screen to inform the user of which problem memory cells have been isolated, so that the user can clearly know the process of processing the correctable memory errors by the server.
In an embodiment, the present application further provides a specific implementation manner of the foregoing S103, where the "displaying memory error information" in the foregoing S103 includes: and displaying the memory error information on a display screen of the server, and controlling a memory warning lamp on the server to light up.
The memory warning lamp can be arranged on the front panel of the server; the memory warning lamp may also be a warning lamp displayed on an application program interface on the server, and is not limited herein. Specifically, when the server monitors that the correctable memory error occurs and determines that the target error processing mode is a reporting error mode, the memory warning lamp can be controlled to be turned on to indicate that the correctable memory error occurs on the memory unit of the server at the moment and the memory error information is displayed on the display screen, so that a user is informed of being capable of viewing the memory error information on the display screen on the server, and the functions of warning and informing of viewing the error information are achieved.
In practical application, the server is provided with an isolation key besides the memory warning lamp, and the isolation key is used for isolating the memory unit according to the user requirement. Therefore, as shown in fig. 6, the memory error processing method provided by the present application further includes:
s501, when detecting that the isolation key on the server is triggered, detecting whether a memory warning lamp is on, if the memory warning lamp is on, executing step S502, and if the memory warning lamp is not on, executing step S503.
The isolation key can be arranged on the front panel of the server; the quarantine button may also be a quarantine button displayed on an application interface on the server, and is not limited herein. Specifically, when the server detects that the isolation key is triggered, it indicates that a user intervenes in the auxiliary server to perform memory error processing, or in other words, the server may also perform isolation processing on the memory unit with the problem in a user intervention manner. In the process, when the server detects that the isolation key is triggered, whether the memory warning lamp is turned on or not can be further detected, because if the memory warning lamp is turned on, it indicates that a correctable memory error occurs in a memory unit in the server, and the number of the correctable memory errors reaches the number to be isolated, if the server does not automatically isolate the memory unit with a problem, the user needs to trigger the isolation key to forcibly isolate the memory unit, so that the operation reliability of the server is improved. If the memory warning lamp is not lighted, it indicates that the isolation key is possibly operated by mistake at this time, and the memory unit which needs to be isolated does not exist in the server.
S502, the memory unit where the memory error can be corrected is isolated, and the memory warning lamp is turned off.
The embodiment relates to an application scenario in which a server detects that an isolation key is triggered and a memory warning lamp is turned on, under the application scenario, the server may call an interrupt program in an isolation error processing mode to perform isolation processing on a memory unit in which a correctable memory error is located, and turn off the memory warning lamp after the processing to indicate that the processing is completed. It should be noted that, in this application scenario, the server may return to execute any of the above embodiments to process the correctable memory error by using the isolated error processing method.
S503, continuously detecting whether the memory warning lamp is on.
The embodiment relates to an application scene that a server detects that an isolation key is triggered and a memory warning lamp is not lighted, wherein in the application scene, the server continuously detects whether the memory warning lamp is lighted or not, and then finds out a memory unit with a problem and conducts isolation processing.
The method realizes the method for processing the correctable memory errors by the server through interactive operation with the user, provides the function of allowing the user to process the correctable memory errors, enables the user to intervene in the auxiliary server to process the correctable memory errors, and improves the efficiency of processing the correctable memory errors by the server.
In an embodiment, an implementation manner of the foregoing S103 is provided, and as shown in fig. 7, the "displaying the memory error information" in the foregoing S103 includes:
s601, checking whether invalid error information exists in the memory error information, if so, executing step S602, and if not, executing step S603; the invalid error information is the memory error information monitored by the server within a preset time period before the current time.
The preset time period may be determined in advance by the server according to the actual application requirement, for example, the invalid error information may be memory error information monitored by the server within 12 hours before the current time, or the invalid error information may be memory error information monitored by the server within 24 hours before the current time. Specifically, before the server reports the memory error information, it may check whether there is invalid error information in the memory error information, and because the time when the invalid error information occurs is far from the current time, it is unnecessary to process the invalid error information, so the server may not process the invalid error information later.
S602, deleting the invalid error information, and displaying the deleted memory error information.
The embodiment relates to an application scenario that memory error information contains invalid error information, in which a server directly deletes the invalid error information from the memory error information, and then only displays valid memory error information, that is, the deleted memory error information.
S603, the memory error information is displayed.
The embodiment relates to an application scenario that the memory error information does not contain invalid error information, and in the scenario, the server directly displays the memory error information.
The embodiment realizes the updating processing of the memory error information, so that the memory error information is always kept as the latest monitored memory error information, and the problem of resource waste caused by unnecessary processing of the memory error information before a long time by the server is avoided.
With reference to all the above embodiments, the present application further provides a memory error handling method, as shown in fig. 8, the method includes:
s701, the server obtains a user selection instruction.
S702, under the condition that the correctable memory errors of the server are monitored, the server acquires memory cells where the correctable memory errors are located and occurrence time.
S703, counting the number of correctable memory errors occurring in each memory cell according to the memory cell in which the correctable memory error exists.
S704, generating memory error information according to the position of each memory cell, the number of correctable memory errors occurring in each memory cell and the occurrence time of each correctable memory error.
S705, the server determines a target error processing mode according to the user selection instruction, if the target error processing mode is a report error mode, the step S706 is executed, and if the target error processing mode is an isolation error mode, the steps S709-S713 are executed.
S706, checking whether invalid error information exists in the memory error information; if yes, step S707 is executed, and if no, step S708 is executed.
And S707, deleting the invalid error information, displaying the deleted memory error information on a display screen of the server, and controlling a memory warning lamp on the server to light up.
S708, displaying the memory error information on the display screen of the server, and controlling the memory warning lamp on the server to light up.
S709, determine whether the number of correctable memory errors occurring in each memory cell is greater than a preset isolation number threshold.
S710, the memory cells with the number of the memory errors larger than the isolation number threshold are determined as problem memory cells.
S711, data in the problem memory cell is migrated to the idle memory cell.
S712, isolating the migrated problematic memory cell.
And S713, identifying and displaying the isolated problem memory cells.
S714, detecting whether the isolation key on the server is triggered, detecting whether the memory warning lamp is on when the isolation key on the server is triggered, if yes, executing step S715, and if not, executing step S716.
S715, return to step S709-S713, perform isolation processing on the memory unit where the correctable memory error is located, and turn off the memory warning lamp.
S716, continue to detect whether the memory warning lamp is on.
The descriptions of the above steps are all embodied in the foregoing description, and please refer to the foregoing description for details, which are not repeated herein.
It should be noted that the memory error processing method provided in the present application may also be applied to an application environment shown in fig. 9, where a Basic Input Output System (BIOS) and a substrate Management Controller (BMC) in a server perform data interaction to implement the memory error processing method. Specifically, the BIOS may execute any one of the steps S701 to S716, where specifically when the memory error information is displayed, the BIOS may transmit the classified and aggregated memory error information to the BMC, and the BMC displays the memory error information on the display screen. On the other hand, when the BIOS stores the memory error information after acquiring the memory error information, the BIOS may transfer the memory error information to the BMC, so that the BMC stores the memory error information as a backup, thereby preventing the memory error information on the BIOS from being damaged and causing an uncorrectable memory error which cannot be processed normally. The BIOS and the BMC can synchronously update the stored memory error information, so that the memory error information stored on the BIOS and the BMC is always up-to-date and kept synchronous.
It should be understood that although the various steps in the flow charts of fig. 2-9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-9 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 10, there is provided a memory error handling apparatus, including: categorised module 11, acquisition module 12, show module 13 and isolation module 14 that gathers, wherein:
and the classification and summary module 11 is configured to classify and summarize the correctable memory errors to obtain memory error information when it is monitored that the correctable memory errors occur in the server.
An obtaining module 12, configured to obtain a target error handling manner of the server; and the target error processing mode is determined according to a user selection instruction when the server is started.
A display module 13, configured to display the memory error information when the target error handling manner is a reporting error manner;
the isolation module 14 is configured to, if the target error handling manner is an isolation error manner, perform isolation processing on the memory unit where the correctable memory error is located.
In one embodiment, the memory error information includes a location of a memory cell, a number of correctable memory errors occurring in the memory cell, and a time of occurrence of the correctable memory errors.
In an embodiment, the aforementioned categorical aggregation module 11, as shown in fig. 11, includes:
an obtaining unit 111, configured to obtain a memory unit and occurrence time of a correctable memory error when it is monitored that the correctable memory error occurs in the server;
a counting unit 112, configured to count, according to the memory cell in which the correctable memory error is located, the number of correctable memory errors occurring in each memory cell;
the generating unit 113 generates the memory error information according to the location of each memory cell, the number of correctable memory errors occurring in each memory cell, and the occurrence time of each correctable memory error.
In one embodiment, the isolation module 14, as shown in fig. 12, includes:
a determining unit 141, configured to determine whether the number of correctable memory errors occurring in each memory unit is greater than a preset isolation number threshold;
a determining unit 142, configured to determine, as a problem memory cell, a memory cell in which the number of memory errors is greater than the isolation number threshold, and perform isolation processing on the problem memory cell.
In one embodiment, the determining unit 142 is specifically configured to migrate data in the problem memory cell to an idle memory cell; the idle memory unit is a memory unit reserved by the server after the target error processing mode is determined to be the error isolation mode; and isolating the problem memory unit after the migration.
In an embodiment, the determining unit 142 is further specifically configured to identify and display the isolated problem memory unit after isolating the migrated problem memory unit.
In an embodiment, the display module 13 is specifically configured to display the memory error information on a display screen of the server, and control a memory warning lamp on the server to light up.
In an embodiment, as shown in fig. 13, the memory error handling apparatus further includes:
the first detection module 15 is configured to detect whether the memory warning lamp is turned on when it is detected that the isolation key on the server is triggered;
the processing module 16 is configured to, when the memory warning lamp is turned on, perform isolation processing on the memory unit where the correctable memory error is located, and turn off the memory warning lamp;
a second detecting module 17, configured to continue to detect whether the memory warning lamp is turned on when the memory warning lamp is not turned on.
In one embodiment, the display module 13, as shown in fig. 14, includes:
a checking unit 131, configured to check whether invalid error information exists in the memory error information; the invalid error information is memory error information monitored by a server within a preset time period before the current time;
a deleting unit 132, configured to delete the invalid error information when the invalid error information exists in the memory error information, and display the deleted memory error information.
For the specific limitation of the memory error handling apparatus, reference may be made to the above limitation on the memory error handling method, which is not described herein again. All or part of the modules in the memory error processing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
under the condition that a correctable memory error occurs in a server, classifying and summarizing the correctable memory error to obtain memory error information;
acquiring a target error processing mode of the server; the target error processing mode is determined according to a user selection instruction when the server is started;
if the target error processing mode is a reporting error mode, displaying the memory error information;
and if the target error processing mode is an error isolation mode, isolating the memory unit where the correctable memory error is located.
The implementation principle and technical effect of the computer device provided by the above embodiment are similar to those of the above method embodiment, and are not described herein again.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
under the condition that a correctable memory error occurs in a server, classifying and summarizing the correctable memory error to obtain memory error information;
acquiring a target error processing mode of the server; the target error processing mode is determined according to a user selection instruction when the server is started;
if the target error processing mode is a reporting error mode, displaying the memory error information;
and if the target error processing mode is an error isolation mode, isolating the memory unit where the correctable memory error is located.
The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for memory error handling, the method comprising:
under the condition that a correctable memory error occurs in a server, classifying and summarizing the correctable memory error to obtain memory error information;
acquiring a target error processing mode of the server; the target error processing mode is determined according to a user selection instruction when the server is started;
if the target error processing mode is a reporting error mode, displaying the memory error information;
and if the target error processing mode is an error isolation mode, isolating the memory unit where the correctable memory error is located.
2. The method of claim 1, wherein the memory error information comprises a location of a memory cell, a number of correctable memory errors occurring in the memory cell, and a time of occurrence of the correctable memory errors.
3. The method according to claim 2, wherein classifying and summarizing the correctable memory errors to obtain memory error information when it is monitored that the correctable memory errors occur in the server comprises:
under the condition that a correctable memory error occurs in the server, acquiring a memory unit where the correctable memory error is located and occurrence time;
counting the number of correctable memory errors occurring in each memory unit according to the memory unit where the correctable memory errors are located;
and generating the memory error information according to the position of each memory unit, the number of correctable memory errors occurring in each memory unit and the occurrence time of each correctable memory error.
4. The method according to claim 2 or 3, wherein the isolating the memory cell in which the correctable memory error is located comprises:
judging whether the number of correctable memory errors occurring on each memory unit is greater than a preset isolation number threshold value or not;
and determining the memory units with the number of the memory errors larger than the isolation number threshold value as problem memory units, and isolating the problem memory units.
5. The method of claim 4, wherein the isolating the problem memory cell comprises:
migrating data in the problem memory unit to an idle memory unit; the idle memory unit is a memory unit reserved by the server after the target error processing mode is determined to be the error isolation mode;
and isolating the problem memory unit after the migration.
6. The method of claim 5, wherein after isolating the migrated problem memory cell, the method further comprises:
and identifying and displaying the isolated problem memory cells.
7. The method of claim 1, wherein said presenting the memory error information comprises:
and displaying the memory error information on a display screen of the server, and controlling a memory warning lamp on the server to light up.
8. The method of claim 7, further comprising:
when detecting that an isolation key on the server is triggered, detecting whether the memory warning lamp is lighted;
if the memory warning lamp is on, isolating the memory unit where the correctable memory error is located, and turning off the memory warning lamp;
and if the memory warning lamp is not lightened, continuously detecting whether the memory warning lamp is lightened.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN202010951988.5A 2020-09-11 2020-09-11 Memory error processing method, device, computer equipment and storage medium Active CN112231128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010951988.5A CN112231128B (en) 2020-09-11 2020-09-11 Memory error processing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010951988.5A CN112231128B (en) 2020-09-11 2020-09-11 Memory error processing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112231128A true CN112231128A (en) 2021-01-15
CN112231128B CN112231128B (en) 2024-06-21

Family

ID=74115656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010951988.5A Active CN112231128B (en) 2020-09-11 2020-09-11 Memory error processing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112231128B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024036473A1 (en) * 2022-08-16 2024-02-22 Micron Technology, Inc. Selectable error handling modes in memory systems
WO2024066500A1 (en) * 2022-09-26 2024-04-04 华为技术有限公司 Memory error processing method and apparatus

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3704363A (en) * 1971-06-09 1972-11-28 Ibm Statistical and environmental data logging system for data processing storage subsystem
US5263032A (en) * 1991-06-27 1993-11-16 Digital Equipment Corporation Computer system operation with corrected read data function
US20090164872A1 (en) * 2007-12-21 2009-06-25 Sun Microsystems, Inc. Prediction and prevention of uncorrectable memory errors
JP2012108726A (en) * 2010-11-17 2012-06-07 Nec Computertechno Ltd Failure processing device, information processing device using the same, and failure processing method for information processing device
JP2013025452A (en) * 2011-07-19 2013-02-04 Nec Computertechno Ltd Memory test device, memory test method and memory test program
CN103324582A (en) * 2013-06-17 2013-09-25 华为技术有限公司 Memory migration method, memory migration device and equipment
US20160055043A1 (en) * 2014-08-20 2016-02-25 Oracle International Corporation Pattern analysis for triaging memory leaks
WO2016106965A1 (en) * 2014-12-31 2016-07-07 中兴通讯股份有限公司 Server self-healing method and device
CN105893166A (en) * 2016-04-29 2016-08-24 浪潮电子信息产业股份有限公司 Method and device for processing memory errors
US20160292026A1 (en) * 2015-03-30 2016-10-06 Ca, Inc. Presenting diagnostic headlines using simple linguistic terms
US20170308449A1 (en) * 2016-04-26 2017-10-26 Servicenow, Inc. Detection and Remediation of Memory Leaks
CN108121603A (en) * 2017-12-20 2018-06-05 安徽皖通邮电股份有限公司 A kind of managing embedded system memory method
CN108491249A (en) * 2018-03-16 2018-09-04 中国人民解放军战略支援部队信息工程大学 A kind of kernel module partition method and system based on module powers and functions
CN109086151A (en) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 The method and device of memory failure is isolated on a kind of server
CN109697153A (en) * 2018-12-28 2019-04-30 浙江省公众信息产业有限公司 Monitoring method, monitoring system and computer readable storage medium
CN109976933A (en) * 2019-02-22 2019-07-05 视联动力信息技术股份有限公司 A kind of log processing method and device
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus
CN111274061A (en) * 2018-12-04 2020-06-12 阿里巴巴集团控股有限公司 System and method for handling uncorrectable data errors in mass storage

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3704363A (en) * 1971-06-09 1972-11-28 Ibm Statistical and environmental data logging system for data processing storage subsystem
US5263032A (en) * 1991-06-27 1993-11-16 Digital Equipment Corporation Computer system operation with corrected read data function
US20090164872A1 (en) * 2007-12-21 2009-06-25 Sun Microsystems, Inc. Prediction and prevention of uncorrectable memory errors
JP2012108726A (en) * 2010-11-17 2012-06-07 Nec Computertechno Ltd Failure processing device, information processing device using the same, and failure processing method for information processing device
JP2013025452A (en) * 2011-07-19 2013-02-04 Nec Computertechno Ltd Memory test device, memory test method and memory test program
CN103324582A (en) * 2013-06-17 2013-09-25 华为技术有限公司 Memory migration method, memory migration device and equipment
US20160055043A1 (en) * 2014-08-20 2016-02-25 Oracle International Corporation Pattern analysis for triaging memory leaks
CN105808394A (en) * 2014-12-31 2016-07-27 中兴通讯股份有限公司 Server self-healing method and device
WO2016106965A1 (en) * 2014-12-31 2016-07-07 中兴通讯股份有限公司 Server self-healing method and device
US20160292026A1 (en) * 2015-03-30 2016-10-06 Ca, Inc. Presenting diagnostic headlines using simple linguistic terms
US20170308449A1 (en) * 2016-04-26 2017-10-26 Servicenow, Inc. Detection and Remediation of Memory Leaks
CN105893166A (en) * 2016-04-29 2016-08-24 浪潮电子信息产业股份有限公司 Method and device for processing memory errors
CN109086151A (en) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 The method and device of memory failure is isolated on a kind of server
CN108121603A (en) * 2017-12-20 2018-06-05 安徽皖通邮电股份有限公司 A kind of managing embedded system memory method
CN108491249A (en) * 2018-03-16 2018-09-04 中国人民解放军战略支援部队信息工程大学 A kind of kernel module partition method and system based on module powers and functions
CN111274061A (en) * 2018-12-04 2020-06-12 阿里巴巴集团控股有限公司 System and method for handling uncorrectable data errors in mass storage
CN109697153A (en) * 2018-12-28 2019-04-30 浙江省公众信息产业有限公司 Monitoring method, monitoring system and computer readable storage medium
CN109976933A (en) * 2019-02-22 2019-07-05 视联动力信息技术股份有限公司 A kind of log processing method and device
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024036473A1 (en) * 2022-08-16 2024-02-22 Micron Technology, Inc. Selectable error handling modes in memory systems
WO2024066500A1 (en) * 2022-09-26 2024-04-04 华为技术有限公司 Memory error processing method and apparatus

Also Published As

Publication number Publication date
CN112231128B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
US10649838B2 (en) Automatic correlation of dynamic system events within computing devices
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
US7320060B2 (en) Method, apparatus, and computer readable medium for managing back-up
EP3979079A1 (en) Memory fault handling method and apparatus, device and storage medium
US20130238319A1 (en) Information processing apparatus and message extraction method
WO2016188175A1 (en) Hardware fault analysis system and method
WO2013153584A1 (en) Storage device
CN112231128B (en) Memory error processing method, device, computer equipment and storage medium
US12021681B2 (en) Communication device, surveillance server, and log collection method
CN109884533B (en) Method and device for diagnosing battery fault, equipment and storage medium
US9679007B1 (en) Techniques for managing references to containers
CN113505044B (en) Database warning method, device, equipment and storage medium
US20200349003A1 (en) Method, device and program product for avoiding a fault event of a disk array
CN112650446A (en) Intelligent storage method, device and equipment of NVMe full flash memory system
JP4383484B2 (en) Message analysis apparatus, control method, and control program
CN111327685A (en) Data processing method, device and equipment of distributed storage system and storage medium
CN112306833A (en) Application program crash statistical method and device, computer equipment and storage medium
US10866875B2 (en) Storage apparatus, storage system, and performance evaluation method using cyclic information cycled within a group of storage apparatuses
CN115793990A (en) Memory health state determination method and device, electronic equipment and storage medium
CN110321067B (en) System and method for estimating and managing storage device degradation
CN113590405A (en) Hard disk error detection method and device, storage medium and electronic device
CN113691395A (en) Network operation and maintenance method and device, computer equipment and storage medium
CN112306744B (en) Log storage backup method, device, server and medium
CN110851316B (en) Abnormality early warning method, abnormality early warning device, abnormality early warning system, electronic equipment and storage medium
US20230305917A1 (en) Operation management apparatus and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant