CN117149490A - Method, device, equipment and storage medium for early warning of memory faults of server - Google Patents

Method, device, equipment and storage medium for early warning of memory faults of server Download PDF

Info

Publication number
CN117149490A
CN117149490A CN202311103756.4A CN202311103756A CN117149490A CN 117149490 A CN117149490 A CN 117149490A CN 202311103756 A CN202311103756 A CN 202311103756A CN 117149490 A CN117149490 A CN 117149490A
Authority
CN
China
Prior art keywords
target
funnel
memory
correctable error
correctable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311103756.4A
Other languages
Chinese (zh)
Inventor
龚树青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Shandong Computer Technology Co Ltd
Original Assignee
Inspur Shandong Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Shandong Computer Technology Co Ltd filed Critical Inspur Shandong Computer Technology Co Ltd
Priority to CN202311103756.4A priority Critical patent/CN117149490A/en
Publication of CN117149490A publication Critical patent/CN117149490A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/076Error or fault detection not based on redundancy by exceeding limits by exceeding a count or rate limit, e.g. word- or bit count limit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a server memory fault early warning method, a device, equipment and a storage medium, which relate to the technical field of computers and comprise the following steps: obtaining correctable error funnel parameter configuration information of a target memory to obtain target configuration information; monitoring correctable errors triggered by a target memory, and recording the number of the correctable errors into a target funnel counter; calculating a current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information; judging whether the current actual count value is larger than a preset correctable error funnel threshold value, if so, recording a correctable error storm event once; counting all correctable error storm events recorded in a preset time to obtain the target storm event times; and judging whether the number of the target storm events is greater than a preset number threshold, if so, generating corresponding memory fault early warning information to perform fault early warning. The application can improve the accuracy of the early warning of the memory faults of the server and reduce the maintenance cost of the server.

Description

Method, device, equipment and storage medium for early warning of memory faults of server
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for early warning of a memory failure of a server.
Background
With the vigorous development of server technology, servers are widely used in various fields, and as one of main components of the servers, memories become fault high-incidence components affecting system stability, so how to identify faults in advance and process the faults before serious faults occur in the memories is a technical problem to be solved in the field.
Currently, to ensure the stability and reliability of a server system, the server typically processes correctable errors in the memory through a memory funnel mechanism, such as an ECC (Error Correction Code ) mechanism, to identify errors that occur in the memory system and correct them (i.e., correctable Error, CE errors). For example, when data in a memory system experiences a bit flip or other hardware failure, an ECC mechanism is used to detect and correct the errors. Specifically, the working principle of the memory funnel mechanism is that the number of times of correctable errors of each memory is recorded through a funnel counter, the funnel counter is polled regularly, and when the number of times of correctable errors is detected to reach a preset threshold value, fault early warning is triggered, so that relevant server management personnel are prompted to conduct fault processing.
However, some servers, such as servers of the sea-light platform, can only support 4095 memory-correctable error counts at most due to hardware limitation of the registers, and the accuracy of memory failure early warning is not high due to the low threshold value. On the other hand, some memory faults belong to soft faults (such as bit overturn caused by cosmic rays, sudden electromagnetic interference and the like), the faults can automatically recover to be normal within a certain time, and are not uncorrectable errors, and if the fault early warning is still carried out in a pure counting mode, early warning and false report are easily caused, so that the memory is replaced by mistake, and further the operation and maintenance efficiency of a server is low.
Disclosure of Invention
Accordingly, the present application aims to provide a method, a device and a storage medium for early warning of memory failure of a server, which can improve the accuracy of early warning of memory failure of the server, avoid error replacement of the memory, and reduce the maintenance cost of the server. The specific scheme is as follows:
in a first aspect, the application discloses a server memory fault early warning method, which comprises the following steps:
obtaining correctable error funnel parameter configuration information of a target memory in a server to obtain target configuration information;
monitoring the correctable errors triggered by the target memory, and recording the number of the correctable errors into a target funnel counter;
calculating a current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information;
judging whether the current actual count value is larger than a preset correctable error funnel threshold value, if so, recording a correctable error storm event once;
counting all the correctable error storm events recorded in the preset time to obtain the target storm event times;
and judging whether the number of the target storm events is greater than a preset number threshold, if so, generating corresponding memory fault early warning information to perform fault early warning on the target memory.
Optionally, the obtaining the configuration information of the correctable error funnel parameter of the target memory in the server to obtain the configuration information of the target includes:
obtaining correctable error funnel parameter configuration information of a target memory from a basic input/output system of a server to obtain target configuration information; the correctable error funnel parameter configuration information comprises a correctable error funnel period, a correctable error funnel frequency and the correctable error funnel threshold.
Optionally, the monitoring the correctable errors triggered by the target memory and recording the number of the correctable errors into a target funnel counter includes:
and monitoring the correctable errors triggered by the target memory through the basic input output system, and recording the monitored number of the correctable errors into a target funnel counter positioned in the basic input output system.
Optionally, the calculating the current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information includes:
counting the number of the correctable errors in a single correctable error funnel period through the target funnel counter to obtain a counting result;
calculating the product of the correctable error funnel period and the correctable error funnel frequency to obtain a target product result;
and calculating the difference value between the statistical result and the target product result to obtain the current actual count value of the target funnel counter in the single correctable error funnel period.
Optionally, the determining whether the current actual count value is greater than a preset correctable error funnel threshold, if yes, recording a correctable error storm event, including:
judging whether the current actual count value is larger than a preset correctable error funnel threshold value or not;
if the current actual count value is larger than the correctable error funnel threshold value, recording a correctable error storm event once, and acquiring the recording time of the correctable error storm event;
reporting the correctable error storm event and the recording time to a baseboard management controller;
correspondingly, the determining whether the target storm event number is greater than a preset number threshold value, if so, generating corresponding memory fault early warning information to perform fault early warning on the target memory, including:
judging whether the number of the target storm events is larger than a preset number threshold value or not through the baseboard management controller;
if the number of the target storm events is larger than the preset number threshold, judging that the target memory is likely to fail, and generating corresponding memory failure early warning information so as to perform failure early warning on the target memory.
Optionally, after reporting the correctable error storm event and the recording time to the baseboard management controller, the method further includes:
binding the correctable error storm event, the recording time and the corresponding target memory through the baseboard management controller.
Optionally, after the determining whether the current actual count value is greater than a preset correctable error funnel threshold, the method further includes:
and if the current actual count value is not greater than the correctable error funnel threshold value, resetting the current actual count value of the target funnel counter, and re-executing the correctable error funnel parameter configuration information of the target memory in the acquisition server to obtain target configuration information.
In a second aspect, the present application discloses a server memory fault early warning device, including:
the information acquisition module is used for acquiring the correctable error funnel parameter configuration information of the target memory in the server to obtain target configuration information;
the monitoring module is used for monitoring the correctable errors triggered by the target memory;
the number recording module is used for recording the number of the correctable errors into a target funnel counter;
a calculation module for calculating a current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information;
the first judging module is used for judging whether the current actual count value is larger than a preset correctable error funnel threshold value or not;
the event recording module is used for recording a correctable error storm event once if the current actual count value is larger than the correctable error funnel threshold value;
the event statistics module is used for counting all the correctable error storm events recorded in the preset time to obtain the times of the target storm events;
the second judging module is used for judging whether the number of the target storm events is larger than a preset number threshold value or not;
and the information generation module is used for generating corresponding memory fault early warning information to perform fault early warning on the target memory if the target storm event times are greater than the preset times threshold value.
In a third aspect, the application discloses an electronic device comprising a processor and a memory; the processor realizes the server memory fault early warning method when executing the computer program stored in the memory.
In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program; the computer program realizes the server memory fault early warning method when being executed by the processor.
It can be seen that, the present application firstly obtains the configuration information of the correctable error funnel parameters of the target memory in the server to obtain the target configuration information, monitors the correctable error triggered by the target memory, then records the number of correctable errors into the target funnel counter, then calculates the current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information, and judges whether the current actual count value is greater than the preset correctable error funnel threshold, if yes, records one correctable error storm event, counts all correctable error storm events recorded in the preset time to obtain the number of target storm events, finally judges whether the number of target storm events is greater than the preset number threshold, if yes, generates corresponding memory fault early warning information to perform fault early warning on the target memory. The application comprehensively considers the time factors and the number of correctable error storm events, prolongs the time of fault early warning, is equivalent to improving the threshold value of memory correctable error count, can improve the accuracy of memory fault early warning of the server, avoids the error replacement of the memory and reduces the maintenance cost of the server.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for early warning of memory failure of a server disclosed by the application;
FIG. 2 is a flowchart of a specific method for early warning of memory failure in a server according to the present application;
fig. 3 is a schematic structural diagram of a server memory failure early warning device disclosed in the present application;
fig. 4 is a block diagram of an electronic device according to the present disclosure.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The embodiment of the application discloses a server memory fault early warning method, which is shown in fig. 1 and comprises the following steps:
step S11: and obtaining the correctable error funnel parameter configuration information of the target memory in the server to obtain the target configuration information.
In this embodiment, first, the correctable error funnel parameter configuration information, that is, CE funnel parameter configuration information, of a target memory to be subjected to fault early warning in a server is read, so as to obtain corresponding target configuration information.
Specifically, the obtaining the configuration information of the correctable error funnel parameters of the target memory in the server to obtain the target configuration information may include: obtaining correctable error funnel parameter configuration information of a target memory from a basic input/output system of a server to obtain target configuration information; the correctable error funnel parameter configuration information comprises a correctable error funnel period, a correctable error funnel frequency and a correctable error funnel threshold. That is, the memory correctable error funnel parameter configuration information is stored in a basic input output system (BIOS, basic Input Output System) of the server, and specifically includes a correctable error funnel period (T), a correctable error funnel frequency (F), and a correctable error funnel threshold (S). For example, the configuration information of the CE funnel parameters in the memory under the BIOS is read to obtain the configuration information including the CE funnel period t=60S, the CE funnel frequency f=60/S, and the CE funnel threshold s=4095.
Step S12: and monitoring the correctable errors triggered by the target memory, and recording the number of the correctable errors into a target funnel counter.
In this embodiment, the correctable errors triggered by the target memory are monitored in real time, and the number of the monitored correctable errors is recorded into a target funnel counter.
Specifically, the monitoring the correctable errors triggered by the target memory and recording the number of the correctable errors to a target funnel counter may include: and monitoring the correctable errors triggered by the target memory through the basic input output system, and recording the monitored number of the correctable errors into a target funnel counter positioned in the basic input output system. In this embodiment, a funnel counter is added in the BIOS of the server, and is used to record in real time the number of CE errors N generated in the monitored CE funnel period, where each time the number of CE errors is increased by 1, the value of N is increased by 1, i.e. each time 1 CE error is detected, the input value of the target funnel counter is increased by 1.
Step S13: a current actual count value of the target funnel counter is calculated based on the number of correctable errors and the target configuration information.
In this embodiment, after the number of correctable errors is recorded in the target funnel counter, further, the current actual count value of the target funnel counter, that is, the actual display value of the target funnel counter, may be calculated based on the number of correctable errors and the target configuration information.
In a specific embodiment, the calculating the current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information may specifically include: counting the number of the correctable errors in a single correctable error funnel period through the target funnel counter to obtain a counting result; calculating the product of the correctable error funnel period and the correctable error funnel frequency to obtain a target product result; and calculating the difference value between the statistical result and the target product result to obtain the current actual count value of the target funnel counter in the single correctable error funnel period. In this embodiment, when calculating the current actual count value D of the target funnel counter, the number of correctable errors in a single correctable error funnel period may be counted by the target funnel counter to obtain a corresponding statistical result, then the product of the correctable error funnel period and the correctable error funnel frequency is calculated to obtain a target product result, that is, the data leaked from the target funnel counter in unit time, and then the difference between the statistical result and the target product result is calculated, so as to obtain the current actual count value of the target funnel counter in the correctable error funnel period. The specific calculation formula is as follows: the funnel counter value d=n-f×t in each funnel period.
Step S14: and judging whether the current actual count value is larger than a preset correctable error funnel threshold value, and if so, recording a correctable error storm event.
In this embodiment, after calculating the current actual count value of the target funnel counter, it is determined whether the current actual count value D is greater than a preset correctable error funnel threshold S, and if the current actual count value D is greater than the correctable error funnel threshold S, a correctable error storm event, that is, a CE storm event, is recorded.
In addition, after the determining whether the current actual count value is greater than the preset correctable error funnel threshold, the method may further include: and if the current actual count value is not greater than the correctable error funnel threshold value, resetting the current actual count value of the target funnel counter, and re-executing the correctable error funnel parameter configuration information of the target memory in the acquisition server to obtain target configuration information. In this embodiment, if the current actual count value D is not greater than the correctable error funnel threshold S, it indicates that the current target memory is in a normal running state, and no fault early warning is required, the current actual count value D of the target funnel counter is directly cleared, and the determination of T in the next correctable error funnel period is entered, that is, the step of obtaining the correctable error funnel parameter configuration information of the target memory in the server is re-executed.
Step S15: and counting all the correctable error storm events recorded in the preset time to obtain the target storm event times.
Further, counting the number of all the correctable error storm events recorded in the preset time to obtain the target storm event times. It should be noted that the preset time may be set according to the actual application requirement, for example, statistics is performed on all correctable error storm events recorded in a period of one month in the target.
Step S16: and judging whether the number of the target storm events is greater than a preset number threshold, if so, generating corresponding memory fault early warning information to perform fault early warning on the target memory.
In this embodiment, after counting the number of times of obtaining the target storm event by counting all the correctable error storm events recorded in the preset time, it may be further determined whether the number of times of the target storm event is greater than a preset number threshold, and if the number of times of the target storm event is greater than the preset number threshold, a piece of memory fault early warning information is generated, so that fault early warning is performed on the target memory through the memory fault early warning information. After the user obtains the memory fault early warning information, the target memory can be correspondingly detected, and if a fault exists, the operation such as replacement is performed on the target memory so as to relieve the fault.
It can be seen that, in the embodiment of the present application, first, correctable error funnel parameter configuration information of a target memory in a server is obtained, target configuration information is obtained, correctable errors triggered by the target memory are monitored, then the number of correctable errors is recorded in a target funnel counter, then, based on the number of correctable errors and the target configuration information, a current actual count value of the target funnel counter is calculated, whether the current actual count value is greater than a preset correctable error funnel threshold value is determined, if yes, correctable error storm events are recorded once, all correctable error storm events recorded in a preset time are counted again, a target storm event number is obtained, finally, whether the target storm event number is greater than a preset number threshold value is determined, if yes, corresponding memory fault early warning information is generated, and fault early warning is performed on the target memory. The embodiment of the application comprehensively considers the time factors and the number of correctable error storm events, prolongs the time of fault early warning, is equivalent to improving the threshold value of memory correctable error count, can improve the accuracy of memory fault early warning of the server, avoids the error replacement of the memory, and reduces the maintenance cost of the server.
The embodiment of the application discloses a specific server memory fault early warning method, which is shown in fig. 2 and comprises the following steps:
step S21: obtaining correctable error funnel parameter configuration information of a target memory from a basic input/output system of a server to obtain target configuration information; the correctable error funnel parameter configuration information comprises a correctable error funnel period, a correctable error funnel frequency and a correctable error funnel threshold.
Step S22: and monitoring the correctable errors triggered by the target memory through the basic input output system, and recording the monitored number of the correctable errors into a target funnel counter positioned in the basic input output system.
Step S23: a current actual count value of the target funnel counter is calculated based on the number of correctable errors and the target configuration information.
Step S24: and judging whether the current actual count value is larger than a preset correctable error funnel threshold value or not.
In this embodiment, after calculating the current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information, it may be further determined whether the current actual count value is greater than the preset correctable error funnel threshold.
Step S25: and if the current actual count value is larger than the correctable error funnel threshold value, recording one correctable error storm event, and acquiring the recording time of the correctable error storm event.
In this embodiment, if the current actual count value is greater than the correctable error funnel threshold value, a recording of a correctable error storm event is performed, and a recording time of the correctable error storm event is obtained, i.e. an occurrence time of the error storm event can be corrected.
Step S26: and reporting the correctable error storm event and the recording time to a baseboard management controller.
Further, the correctable error storm event is reported to a baseboard management controller (BMC, baseboard Management Controller) along with the recording time.
In a specific embodiment, after reporting the correctable error storm event and the recording time to the baseboard management controller, the method further includes: binding the correctable error storm event, the recording time and the corresponding target memory through the baseboard management controller. That is, while reporting the correctable error storm event to the BMC, the BMC may record the time of the event and the corresponding memory.
Step S27: and counting all the correctable error storm events recorded in the preset time to obtain the target storm event times.
Step S28: judging whether the number of the target storm events is larger than a preset number threshold or not through the baseboard management controller.
In this embodiment, after the number of times of obtaining the target storm event by counting all the correctable error storm events recorded in the preset time, the BMC may determine whether the number of times of the target storm event is greater than a preset number threshold.
Step S29: if the number of the target storm events is larger than the preset number threshold, judging that the target memory is likely to fail, and generating corresponding memory failure early warning information so as to perform failure early warning on the target memory.
For example, when 5 CE storm events occur within a month period in the target and exceeds a preset number of times threshold 3, the memory is determined to be a memory with possible faults, and a piece of memory fault early warning information is generated to perform fault early warning on the target memory.
For more specific processing procedures in the steps S21, S22, S23, and S27, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
It can be seen that, in the embodiment of the present application, the configuration information of the parameter of the target memory including the correctable error funnel period, the correctable error funnel frequency and the correctable error funnel threshold is obtained from the basic input output system of the server, and then the correctable error triggered by the target memory is monitored through the basic input output system, and the number of the correctable errors that are monitored is recorded into the target funnel counter located in the basic input output system, then the current actual count value of the target funnel counter is calculated based on the number of the correctable error and the target configuration information, and it is determined whether the current actual count value is greater than the correctable error funnel threshold, if the current actual count value is greater than the correctable error funnel threshold, a correctable error storm event is recorded once, and the recording time of the correctable error storm event is obtained, and then the correctable error storm event and the recording time are reported to the substrate management controller, and all the correctable error events recorded in the preset time are counted, and finally, if the number of the storm events is greater than the correctable error storm event is detected, the number of times is determined to be greater than the correctable error storm event, and if the number of times is greater than the correctable error storm event is stored in the target memory, and if the number of times is greater than the threshold. The embodiment of the application comprehensively considers the time factors and the number of correctable error storm events, prolongs the time of fault early warning, is equivalent to improving the threshold value of memory correctable error count, can not only improve the accuracy of memory fault early warning of the server, but also avoid memory error replacement caused by fault error report, thereby reducing the maintenance cost of the server.
Correspondingly, the embodiment of the application also discloses a server memory fault early warning device, which is shown in fig. 3, and comprises:
the information acquisition module 11 is configured to acquire correctable error funnel parameter configuration information of a target memory in the server, and obtain target configuration information;
a monitoring module 12, configured to monitor the target memory for triggered correctable errors;
a number recording module 13, configured to record the number of correctable errors into a target funnel counter;
a calculation module 14 for calculating a current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information;
a first judging module 15, configured to judge whether the current actual count value is greater than a preset correctable error funnel threshold;
an event logging module 16 for logging a correctable error storm event if said current actual count value is greater than said correctable error funnel threshold;
the event statistics module 17 is configured to count all the correctable error storm events recorded in the preset time, so as to obtain the number of target storm events;
a second judging module 18, configured to judge whether the number of times of the target storm event is greater than a preset number threshold;
the information generating module 19 is configured to generate corresponding memory failure early warning information to perform failure early warning on the target memory if the target storm event number is greater than the preset number threshold.
The specific workflow of each module may refer to the corresponding content disclosed in the foregoing embodiment, and will not be described herein.
It can be seen that, in the embodiment of the present application, first, correctable error funnel parameter configuration information of a target memory in a server is obtained, target configuration information is obtained, correctable errors triggered by the target memory are monitored, then the number of correctable errors is recorded in a target funnel counter, then, based on the number of correctable errors and the target configuration information, a current actual count value of the target funnel counter is calculated, whether the current actual count value is greater than a preset correctable error funnel threshold value is determined, if yes, correctable error storm events are recorded once, then, all correctable error storm events recorded in a preset time are counted, a target storm event number is obtained, finally, whether the target storm event number is greater than a preset number threshold value is determined, if yes, corresponding memory fault early warning information is generated, and fault early warning is performed on the target memory. The application comprehensively considers the time factors and the number of correctable error storm events, prolongs the time of fault early warning, is equivalent to improving the threshold value of memory correctable error count, can improve the accuracy of memory fault early warning of the server, avoids the error replacement of the memory and reduces the maintenance cost of the server.
In some specific embodiments, the information obtaining module 11 may specifically include:
the information acquisition unit is used for acquiring the correctable error funnel parameter configuration information of the target memory from the basic input/output system of the server to obtain target configuration information; the correctable error funnel parameter configuration information comprises a correctable error funnel period, a correctable error funnel frequency and the correctable error funnel threshold.
In some embodiments, the monitoring module 12 may specifically include:
and the monitoring unit is used for monitoring the correctable errors triggered by the target memory through the basic input/output system.
In some specific embodiments, the number recording module 13 may specifically include:
and the quantity recording unit is used for recording the monitored quantity of the correctable errors into a target funnel counter of the basic input/output system.
In some embodiments, the computing module 14 may specifically include:
the quantity counting unit is used for counting the quantity of the correctable errors in a single correctable error funnel period through the target funnel counter to obtain a counting result;
the product calculation unit is used for calculating the product of the correctable error funnel period and the correctable error funnel frequency to obtain a target product result;
and the difference value calculation unit is used for calculating the difference value between the statistical result and the target product result to obtain the current actual count value of the target funnel counter in the single correctable error funnel period.
In some specific embodiments, the first determining module 15 may specifically include:
and the first judging unit is used for judging whether the current actual count value is larger than a preset correctable error funnel threshold value or not.
In some embodiments, the event recording module 16 may specifically include:
the event recording unit is used for recording a correctable error storm event once if the current actual count value is larger than the correctable error funnel threshold value;
a time acquisition unit, configured to acquire a recording time of the correctable error storm event;
the information reporting unit is used for reporting the correctable error storm event and the recording time to the substrate management controller;
correspondingly, the second determining module 18 may specifically include:
the second judging unit is used for judging whether the target storm event times are larger than a preset times threshold value or not through the substrate management controller;
correspondingly, the information generating module 19 may specifically include:
and the fault early warning unit is used for judging that the target memory is likely to have faults if the target storm event times are larger than the preset times threshold value, and generating corresponding memory fault early warning information so as to perform fault early warning on the target memory.
In some specific embodiments, after the information reporting unit, the method may further include:
and the information binding unit is used for binding the correctable error storm event, the recording time and the corresponding target memory through the baseboard management controller.
In some embodiments, after the first determining module 15, the method may further include:
a count value zero clearing unit, configured to zero-clear the current actual count value of the target funnel counter if the current actual count value is not greater than the correctable error funnel threshold;
and the execution unit is used for re-executing the correctable error funnel parameter configuration information of the target memory in the acquisition server to obtain target configuration information.
Further, the embodiment of the present application further discloses an electronic device, and fig. 4 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.
Fig. 4 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. The memory 22 is configured to store a computer program, where the computer program is loaded and executed by the processor 21 to implement relevant steps in the server memory failure early warning method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and computer programs 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the server memory failure warning method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
Further, the application also discloses a computer readable storage medium for storing a computer program; the computer program realizes the server memory fault early warning method when being executed by the processor. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing describes in detail a method, apparatus, device and storage medium for early warning of memory failure in a server, and specific examples are applied to illustrate the principles and embodiments of the present application, and the description of the foregoing examples is only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. The method for early warning the memory faults of the server is characterized by comprising the following steps of:
obtaining correctable error funnel parameter configuration information of a target memory in a server to obtain target configuration information;
monitoring the correctable errors triggered by the target memory, and recording the number of the correctable errors into a target funnel counter;
calculating a current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information;
judging whether the current actual count value is larger than a preset correctable error funnel threshold value, if so, recording a correctable error storm event once;
counting all the correctable error storm events recorded in the preset time to obtain the target storm event times;
and judging whether the number of the target storm events is greater than a preset number threshold, if so, generating corresponding memory fault early warning information to perform fault early warning on the target memory.
2. The method for early warning of a server memory failure according to claim 1, wherein the obtaining the configuration information of the correctable error funnel parameters of the target memory in the server to obtain the target configuration information includes:
obtaining correctable error funnel parameter configuration information of a target memory from a basic input/output system of a server to obtain target configuration information; the correctable error funnel parameter configuration information comprises a correctable error funnel period, a correctable error funnel frequency and the correctable error funnel threshold.
3. The server memory failure pre-warning method according to claim 2, wherein the monitoring the target memory triggered correctable errors and recording the number of correctable errors into a target funnel counter comprises:
and monitoring the correctable errors triggered by the target memory through the basic input output system, and recording the monitored number of the correctable errors into a target funnel counter positioned in the basic input output system.
4. The server memory failure warning method according to claim 2, wherein the calculating the current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information includes:
counting the number of the correctable errors in a single correctable error funnel period through the target funnel counter to obtain a counting result;
calculating the product of the correctable error funnel period and the correctable error funnel frequency to obtain a target product result;
and calculating the difference value between the statistical result and the target product result to obtain the current actual count value of the target funnel counter in the single correctable error funnel period.
5. The method for early warning of a memory failure in a server according to claim 4, wherein the determining whether the current actual count value is greater than a preset correctable error funnel threshold, if so, recording a correctable error storm event comprises:
judging whether the current actual count value is larger than a preset correctable error funnel threshold value or not;
if the current actual count value is larger than the correctable error funnel threshold value, recording a correctable error storm event once, and acquiring the recording time of the correctable error storm event;
reporting the correctable error storm event and the recording time to a baseboard management controller;
correspondingly, the determining whether the target storm event number is greater than a preset number threshold value, if so, generating corresponding memory fault early warning information to perform fault early warning on the target memory, including:
judging whether the number of the target storm events is larger than a preset number threshold value or not through the baseboard management controller;
if the number of the target storm events is larger than the preset number threshold, judging that the target memory is likely to fail, and generating corresponding memory failure early warning information so as to perform failure early warning on the target memory.
6. The method for pre-warning of a memory failure of a server according to claim 5, wherein after reporting the correctable error storm event and the recording time to a baseboard management controller, further comprising:
binding the correctable error storm event, the recording time and the corresponding target memory through the baseboard management controller.
7. The method for early warning of a memory failure in a server according to any one of claims 1 to 6, further comprising, after the determining whether the current actual count value is greater than a preset correctable error funnel threshold:
and if the current actual count value is not greater than the correctable error funnel threshold value, resetting the current actual count value of the target funnel counter, and re-executing the correctable error funnel parameter configuration information of the target memory in the acquisition server to obtain target configuration information.
8. The utility model provides a server memory trouble early warning device which characterized in that includes:
the information acquisition module is used for acquiring the correctable error funnel parameter configuration information of the target memory in the server to obtain target configuration information;
the monitoring module is used for monitoring the correctable errors triggered by the target memory;
the number recording module is used for recording the number of the correctable errors into a target funnel counter;
a calculation module for calculating a current actual count value of the target funnel counter based on the number of correctable errors and the target configuration information;
the first judging module is used for judging whether the current actual count value is larger than a preset correctable error funnel threshold value or not;
the event recording module is used for recording a correctable error storm event once if the current actual count value is larger than the correctable error funnel threshold value;
the event statistics module is used for counting all the correctable error storm events recorded in the preset time to obtain the times of the target storm events;
the second judging module is used for judging whether the number of the target storm events is larger than a preset number threshold value or not;
and the information generation module is used for generating corresponding memory fault early warning information to perform fault early warning on the target memory if the target storm event times are greater than the preset times threshold value.
9. An electronic device comprising a processor and a memory; the method for early warning of memory failure of a server according to any one of claims 1 to 7 is realized when the processor executes the computer program stored in the memory.
10. A computer-readable storage medium storing a computer program; wherein the computer program when executed by a processor implements the server memory failure warning method according to any one of claims 1 to 7.
CN202311103756.4A 2023-08-30 2023-08-30 Method, device, equipment and storage medium for early warning of memory faults of server Pending CN117149490A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311103756.4A CN117149490A (en) 2023-08-30 2023-08-30 Method, device, equipment and storage medium for early warning of memory faults of server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311103756.4A CN117149490A (en) 2023-08-30 2023-08-30 Method, device, equipment and storage medium for early warning of memory faults of server

Publications (1)

Publication Number Publication Date
CN117149490A true CN117149490A (en) 2023-12-01

Family

ID=88898183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311103756.4A Pending CN117149490A (en) 2023-08-30 2023-08-30 Method, device, equipment and storage medium for early warning of memory faults of server

Country Status (1)

Country Link
CN (1) CN117149490A (en)

Similar Documents

Publication Publication Date Title
JP6828096B2 (en) Server hardware failure analysis and recovery
US9600394B2 (en) Stateful detection of anomalous events in virtual machines
US9720823B2 (en) Free memory trending for detecting out-of-memory events in virtual machines
US9069889B2 (en) Automated enablement of performance data collection
CN108388489B (en) Server fault diagnosis method, system, equipment and storage medium
JP5088411B2 (en) System operation management support program, method and apparatus
US7702780B2 (en) Monitoring method, system, and computer program based on severity and persistence of problems
US10248561B2 (en) Stateless detection of out-of-memory events in virtual machines
JP2004348740A (en) Self-learning method and system for detecting abnormality
US20080307273A1 (en) System And Method For Predictive Failure Detection
CN105117301A (en) Memory warning method and apparatus
US20030084376A1 (en) Software crash event analysis method and system
Bovenzi et al. An OS-level framework for anomaly detection in complex software systems
CN110008090B (en) Method and device for monitoring memory errors and computer readable storage medium
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
CN115981898A (en) Error-correctable error processing method, device and equipment for memory and readable storage medium
CN106201753B (en) Method and system for processing PCIE errors in linux
US8214693B2 (en) Damaged software system detection
CN108899059B (en) Detection method and equipment for solid state disk
CN117149490A (en) Method, device, equipment and storage medium for early warning of memory faults of server
US8850290B2 (en) Error rate threshold for storage of data
US20140303751A1 (en) Methods and Systems for Infrastructure-Monitoring Control
CN113917385A (en) Self-detection method and system for electric energy meter
CN117076186B (en) Memory fault detection method, system, device, medium and server
KR101791021B1 (en) Monitoring Method by Automatic Threshold Setting and Automatic Monitoring System applying the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination