CN110674005A - Method and device for monitoring server memory and readable medium - Google Patents

Method and device for monitoring server memory and readable medium Download PDF

Info

Publication number
CN110674005A
CN110674005A CN201910812159.6A CN201910812159A CN110674005A CN 110674005 A CN110674005 A CN 110674005A CN 201910812159 A CN201910812159 A CN 201910812159A CN 110674005 A CN110674005 A CN 110674005A
Authority
CN
China
Prior art keywords
log
information
error checking
memory
continuous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910812159.6A
Other languages
Chinese (zh)
Inventor
岳远斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wave Intelligent Technology Co Ltd
Original Assignee
Suzhou Wave Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Wave Intelligent Technology Co Ltd filed Critical Suzhou Wave Intelligent Technology Co Ltd
Priority to CN201910812159.6A priority Critical patent/CN110674005A/en
Publication of CN110674005A publication Critical patent/CN110674005A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method, a device and a medium for monitoring a server memory, which comprise the following steps: receiving error checking and correcting information and counting; judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not; recording log information into the black-box log in response to a number of times error checking and correction information is received within a first preset duration of time in succession reaching a first threshold; judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and recording the log information into a BMC system log to remind a user of replacing the memory in response to the log information in the black box log reaching a preset standard within a second continuous preset time. The scheme provided by the invention realizes secondary filtration of ECC monitoring, avoids waste and influence on customer service caused by memory replacement due to generation of a small amount of ECC error reports, and can ensure timely processing when a fault really occurs, thereby ensuring the stability of system operation.

Description

Method and device for monitoring server memory and readable medium
Technical Field
The present invention relates to the field of servers, and in particular, to a method, an apparatus, and a readable medium for monitoring a server memory.
Background
With the advent of big data, cloud computing and artificial intelligence era, the internet business volume and data volume are dramatically increased, and data becomes the foundation of a plurality of current key technologies; in server systems, as the amount of data used for storage and computation increases, the carrying pressure of the servers increases, which places increasing demands on the stability of the entire data link. In the data transmission process, for data transmission errors on a link, a small number of errors can be corrected through a checking mechanism, if the link stability is poor, or hardware equipment on the data transmission link is damaged, the transmitted error data can be accumulated frequently and continuously, the computing performance and the storage performance of a server can be reduced slightly, a hard disk can be dropped and a system can be crashed seriously, and the caused consequences are service interruption of a client, data loss and unpredictable loss.
In the server system, in order to make the system operation more secure and stable, the used memories must have high stability and Error correction performance, and the related technology is a memory with an ECC (Error correction Code) function; for the ECC technique, errors of a single data can be corrected by itself and used continuously, and failure to correct errors of two data at the same time can cause system downtime. As the frequency of the memory increases, the requirement for signal transmission is higher and higher, which inevitably results in an increase in the number of single data errors, but such single data errors are allowed to some extent, because a small number of errors will not affect the performance and normal operation of the system. However, if the number of the ECCs is large, for example, the memory body is damaged, a storm of the ECCs in a short time is caused, and thus, more resources are occupied in a short time to process the ECC errors, which may cause reduction of the computing performance and the storage performance of the server, and may seriously cause a hard disk drop and a system downtime. ECC is monitored in the current design, when the number of ECC exceeds the BIOS threshold value, an alarm is given and a memory is replaced, but the ECC with less number is not specially processed, and the mechanism can generate larger maintenance cost; only the ECC storm needs to be monitored and judged, and the performance of the ECC storm is reduced, so that the memory needs to be replaced in time. Therefore, the ECC storm needs to be actively monitored and judged, and the occurrence of the ECC storm is timely reflected in the monitoring log, so that the influence on the system level caused by the failure of timely replacement is avoided.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for monitoring a server memory, which mainly implement secondary filtering of ECC monitoring by adding a recording link of black box logs, achieve the purpose of ECC storm monitoring, ensure that a true failure can be identified, avoid waste and influence on customer services caused by memory replacement due to a small number of ECC errors, ensure timely processing when a true failure occurs, ensure stability of system operation, and improve customer satisfaction and product competitiveness.
Based on the above object, an aspect of the embodiments of the present invention provides a method for monitoring a server memory, including the following steps: receiving error checking and correcting information and counting; judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not; recording log information into the black-box log in response to a number of times error checking and correction information is received within a first preset duration of time in succession reaching a first threshold; judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and recording the log information into a Baseboard Management Controller (BMC) system log in response to the log information in the black box log reaching a predetermined standard within a second preset duration to remind a user of replacing the memory.
In some embodiments, the determining whether the log information in the black box log meets the predetermined criterion within a second preset duration comprises: and judging whether the frequency of receiving the error checking and correcting information in a second continuous preset time length reaches a second threshold value.
In some embodiments, receiving error checking and correction information and counting comprises: the system repairs the memory based on the received error checking and correction information.
In some embodiments, in response to the number of times the error checking and correcting information is received within a second preset duration of time reaching a first threshold, recording the log information into the black box log further comprises: triggering an interrupt and sending the interrupt to the BMC in response to the number of times the error checking and correcting information is received within a first predetermined duration of time that is continuous reaching a first threshold.
In some embodiments, further comprising: judging whether the memory is replaced or not; and in response to the memory having been replaced, zeroing the number of times the error checking and correcting information is received.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: receiving error checking and correcting information and counting; judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not; recording log information into the black-box log in response to a number of times error checking and correction information is received within a first preset duration of time in succession reaching a first threshold; judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and recording the log information into a BMC system log to remind a user of replacing the memory in response to the log information in the black box log reaching a preset standard within a second continuous preset time.
In some embodiments, the determining whether the log information in the black box log meets the predetermined criterion within a second preset duration comprises: and judging whether the frequency of receiving the error checking and correcting information in a second continuous preset time length reaches a second threshold value.
In some embodiments, receiving error checking and correction information and counting comprises: the system repairs the memory based on the received error checking and correction information.
In some embodiments, in response to the number of times the error checking and correction information is received within a first preset duration of time reaching a first threshold, recording the log information into the black box log further comprises: triggering an interrupt and sending the interrupt to the BMC in response to the number of times the error checking and correcting information is received within a first predetermined duration of time that is continuous reaching a first threshold.
In yet another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, which stores a computer program for executing the above method when executed by a processor.
The invention has the following beneficial technical effects: through increasing the record link of black box log, realize the secondary filter to ECC control, reach the purpose of ECC storm monitoring, ensure to discern real trouble, can avoid the waste and the influence to customer's business that bring because of producing a small amount of ECC and reporting by mistake and changing the memory, can guarantee again to handle in time when really breaking down, guarantee the stability of system operation, improve the competitiveness of customer satisfaction and product.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating an embodiment of a method for monitoring a server memory according to the present invention;
fig. 2 is a flowchart of a method for monitoring a server memory according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a method for monitoring a server memory. Fig. 1 is a schematic diagram illustrating an embodiment of a method for monitoring a server memory according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:
s1, receiving error checking and correcting information and counting;
s2, judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not;
s3, responding to the times of receiving error checking and correcting information within a first continuous preset time length reaching a first threshold value, and recording log information into a black box log;
s4, judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and
and S5, recording the log information into a BMC system log in response to the fact that the log information in the black box log reaches a preset standard within a second continuous preset time, so as to remind a user of replacing the memory.
Error checking and correction information is received and counted. Setting a first threshold value in a BIOS (Basic Input Output System), detecting a memory of the System, generating error checking and correcting information when a single data (e.g., one bit) error occurs, and receiving and counting the error checking and correcting information by the BIOS. In some embodiments, receiving error checking and correction information and counting comprises: the system repairs the memory based on the received error checking and correction information.
In some embodiments, the counter may be timed to zero. In some cases, because part of errors can be repaired inside the system, when the frequency of the data errors is not high, the system memory can be normally used, if the data errors are not cleared, the number of times of receiving error checking and correcting information always reaches a first threshold value, and in order to avoid the situation, the counter can be cleared once at intervals.
It is determined whether the number of times the error checking and correcting information is received within a first predetermined duration of time has reached a first threshold. In response to the number of times the error checking and correcting information is received within a first preset duration of time in succession reaching a first threshold, log information is recorded into the black box log. In some embodiments, in response to the number of times the error checking and correcting information is received within a first preset duration of time reaching a first threshold, recording the log information into the black box log further comprises: triggering an interrupt in response to the number of times the error checking and correcting information is received within a first preset duration of time in succession reaching a first threshold, and sending the interrupt to the BMC. The BMC will record the interrupt to the black box log after receiving the interrupt.
And judging whether the log information in the black box log reaches a preset standard within a second continuous preset time. In some embodiments, the determining whether the log information in the black box log meets the predetermined criterion within a second preset duration comprises: and judging whether the frequency of receiving the error checking and correcting information in a second continuous preset time length reaches a second threshold value. The number of times recorded in the black box log within a certain time period can be counted, for example, when the number of errors reported within a half hour is greater than 5 times, an ECC storm is considered to exist. And recording the log information into a BMC system log in response to the fact that the log information in the black box log reaches a preset standard so as to remind a user of replacing the memory.
In certain embodiments, the above method further comprises: judging whether the memory is replaced or not; and in response to the memory having been replaced, zeroing the number of times the error checking and correcting information is received.
The invention provides a method for realizing storm monitoring through a BIOS and a black box log. The storm monitoring method is characterized in that optimization design is carried out on the basis of the original mechanism, and the purpose of storm monitoring is achieved through double monitoring of the BIOS and the black box log. Specifically, firstly, the BIOS sets the ECC threshold, and when the threshold is reached, an interrupt is triggered to be sent to the BMC; secondly, after receiving the interrupt, the BMC records the interrupt in a black box log, and at the moment, a user cannot perceive that an ECC error occurs; and finally, counting the times recorded in the black box log within a certain time, for example, when the error number is more than 5 times within a half hour, considering that an ECC storm exists, recording the ECC storm into a system log of the BMC, and monitoring the ECC by the terminal user at the moment. Therefore, whether the memory is replaced can be determined by checking whether the system log of the BMC has corresponding error report. The method can avoid waste and influence on customer service caused by memory replacement due to the generation of a small amount of Error Correction Code (ECC) errors, can ensure timely processing when a fault really occurs, ensures the stability of system operation, and improves customer satisfaction and product competitiveness.
Fig. 2 is a flowchart illustrating an embodiment of a method for monitoring a server memory according to the present invention. As shown in FIG. 2, beginning at block 101 and proceeding to block 102, error checking and correction information is received and counted; continuing to block 103, the system repairs the memory based on the received error checking and correction information; proceeding to block 104, determining whether the number of times the error checking and correcting information is received reaches a first threshold, if so, proceeding to block 105, recording the log information into the black box log, if not, returning to block 102; after recording the log information into the black box log, the process may proceed to block 106, determine whether the log information in the black box log meets a predetermined criterion, if yes, proceed to block 107, record the log information into the BMC system log, and then proceed to block 108 to end, if no, return to block 105.
It should be particularly noted that, in the embodiments of the method for monitoring a server memory, steps of the method for monitoring a server memory may be mutually intersected, replaced, added, and deleted, so that these methods for monitoring a server memory, which are reasonably arranged, combined and transformed, should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.
In view of the above object, a second aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, receiving error checking and correcting information and counting; s2, judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not; s3, responding to the times of receiving error checking and correcting information within a first continuous preset time length reaching a first threshold value, and recording log information into a black box log; s4, judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and S5, recording the log information into the BMC system log in response to the fact that the log information in the black box log reaches a preset standard within a second continuous preset time, so as to remind a user of replacing the memory.
In some embodiments, the determining whether the log information in the black box log meets the predetermined criterion within a second preset duration comprises: and judging whether the frequency of receiving the error checking and correcting information in a second continuous preset time length reaches a second threshold value.
In some embodiments, receiving error checking and correction information and counting comprises: the system repairs the memory based on the received error checking and correction information.
In some embodiments, in response to the number of times the error checking and correction information is received within a first preset duration of time reaching a first threshold, recording the log information into the black box log further comprises: and triggering an interrupt in response to the number of times of receiving the error checking and correcting information within a continuous first preset time period reaching a first threshold value, and sending the interrupt to the BMC.
In some embodiments, further comprising: judging whether the memory is replaced or not; and in response to the memory having been replaced, zeroing the number of times the error checking and correcting information is received.
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for monitoring the memory of the server can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A method for monitoring server memory, comprising:
receiving error checking and correcting information and counting;
judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not;
recording log information into the black-box log in response to a number of times error checking and correction information is received within a first preset duration of time in succession reaching a first threshold;
judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and
and recording the log information into a BMC system log to remind a user of replacing the memory in response to the log information in the black box log reaching a preset standard within a second continuous preset time.
2. The method of claim 1, wherein the determining whether the log information in the black box log meets the predetermined criteria within a second preset duration comprises:
and judging whether the frequency of receiving the error checking and correcting information in a second continuous preset time length reaches a second threshold value.
3. The method of claim 1, wherein receiving error checking and correction information and counting comprises:
the system repairs the memory based on the received error checking and correction information.
4. The method of claim 1, wherein recording log information into the black box log in response to the number of times the error checking and correcting information is received within a first preset duration of time reaching a first threshold further comprises:
triggering an interrupt and sending the interrupt to the BMC in response to the number of times the error checking and correcting information is received within a first preset duration of time that is continuous reaching a first threshold.
5. The method of claim 1, further comprising:
judging whether the memory is replaced or not; and
the number of times the error checking and correction information is received is zeroed in response to the memory having been replaced.
6. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of:
receiving error checking and correcting information and counting;
judging whether the times of receiving error checking and correcting information in a continuous first preset time length reaches a first threshold value or not;
recording log information into the black-box log in response to a number of times error checking and correction information is received within a first preset duration of time in succession reaching a first threshold;
judging whether the log information in the black box log reaches a preset standard within a second continuous preset time; and
and recording the log information into a BMC system log to remind a user of replacing the memory in response to the log information in the black box log reaching a preset standard within a second continuous preset time.
7. The computer device according to claim 6, wherein the determining whether the log information in the black box log reaches the predetermined standard within a second preset duration comprises:
and judging whether the frequency of receiving the error checking and correcting information in a second continuous preset time length reaches a second threshold value.
8. The computer device of claim 6, wherein receiving error checking and correction information and counting comprises:
the system repairs the memory based on the received error checking and correction information.
9. The computer device of claim 6, wherein the recording log information into the black box log in response to the number of times the error checking and correcting information is received within a first preset duration of time reaching a first threshold further comprises:
triggering an interrupt and sending the interrupt to the BMC in response to the number of times the error checking and correcting information is received within a first preset duration of time that is continuous reaching a first threshold.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 5.
CN201910812159.6A 2019-08-30 2019-08-30 Method and device for monitoring server memory and readable medium Withdrawn CN110674005A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910812159.6A CN110674005A (en) 2019-08-30 2019-08-30 Method and device for monitoring server memory and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910812159.6A CN110674005A (en) 2019-08-30 2019-08-30 Method and device for monitoring server memory and readable medium

Publications (1)

Publication Number Publication Date
CN110674005A true CN110674005A (en) 2020-01-10

Family

ID=69075903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910812159.6A Withdrawn CN110674005A (en) 2019-08-30 2019-08-30 Method and device for monitoring server memory and readable medium

Country Status (1)

Country Link
CN (1) CN110674005A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111290918A (en) * 2020-02-26 2020-06-16 苏州浪潮智能科技有限公司 Server running state monitoring method and device and computer readable storage medium
CN111367699A (en) * 2020-02-23 2020-07-03 苏州浪潮智能科技有限公司 Method, system, device and medium for processing error information
CN111475361A (en) * 2020-04-17 2020-07-31 苏州浪潮智能科技有限公司 Method, system, equipment and readable medium for testing server video card
CN111930553A (en) * 2020-07-15 2020-11-13 烽火通信科技股份有限公司 Server memory fault early warning method and system
CN112131039A (en) * 2020-09-18 2020-12-25 苏州浪潮智能科技有限公司 Memory ECC information reporting control method, device, equipment and medium
CN112231130A (en) * 2020-09-18 2021-01-15 苏州浪潮智能科技有限公司 Method, system, equipment and medium for positioning fault according to log
CN112306732A (en) * 2020-11-19 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Automatic error correction control method, device, equipment and medium in server
CN113010338A (en) * 2021-02-19 2021-06-22 山东英信计算机技术有限公司 Error leakage threshold value adjusting method, device, equipment and medium of memory CE
CN113204763A (en) * 2021-04-22 2021-08-03 山东英信计算机技术有限公司 Method, system, storage medium and device for receiving data by CPLD
CN113297038A (en) * 2020-04-07 2021-08-24 阿里巴巴集团控股有限公司 Monitoring and data processing method, device and equipment, monitoring system and data center
CN115543678A (en) * 2022-11-30 2022-12-30 苏州浪潮智能科技有限公司 Method, system, storage medium and equipment for monitoring DDR5 memory granule errors
CN116841836A (en) * 2023-09-01 2023-10-03 四川华鲲振宇智能科技有限责任公司 One-key log collecting tool
CN117076212A (en) * 2023-10-17 2023-11-17 北京卡普拉科技有限公司 Consistency check method, device, medium and equipment for MPI communication data content
WO2024066589A1 (en) * 2022-09-28 2024-04-04 超聚变数字技术有限公司 Processing method for hardware error reporting, and related device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958965A (en) * 2018-06-28 2018-12-07 郑州云海信息技术有限公司 A kind of BMC monitoring can restore the method, device and equipment of ECC error
CN109328340A (en) * 2017-09-30 2019-02-12 华为技术有限公司 Detection method, device and the server of memory failure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109328340A (en) * 2017-09-30 2019-02-12 华为技术有限公司 Detection method, device and the server of memory failure
CN108958965A (en) * 2018-06-28 2018-12-07 郑州云海信息技术有限公司 A kind of BMC monitoring can restore the method, device and equipment of ECC error

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111367699A (en) * 2020-02-23 2020-07-03 苏州浪潮智能科技有限公司 Method, system, device and medium for processing error information
CN111290918A (en) * 2020-02-26 2020-06-16 苏州浪潮智能科技有限公司 Server running state monitoring method and device and computer readable storage medium
CN113297038A (en) * 2020-04-07 2021-08-24 阿里巴巴集团控股有限公司 Monitoring and data processing method, device and equipment, monitoring system and data center
CN111475361A (en) * 2020-04-17 2020-07-31 苏州浪潮智能科技有限公司 Method, system, equipment and readable medium for testing server video card
CN111930553A (en) * 2020-07-15 2020-11-13 烽火通信科技股份有限公司 Server memory fault early warning method and system
CN111930553B (en) * 2020-07-15 2022-05-20 烽火通信科技股份有限公司 Server memory fault early warning method and system
CN112131039B (en) * 2020-09-18 2023-01-10 苏州浪潮智能科技有限公司 Memory ECC information reporting control method, device, equipment and medium
CN112231130A (en) * 2020-09-18 2021-01-15 苏州浪潮智能科技有限公司 Method, system, equipment and medium for positioning fault according to log
CN112131039A (en) * 2020-09-18 2020-12-25 苏州浪潮智能科技有限公司 Memory ECC information reporting control method, device, equipment and medium
CN112306732A (en) * 2020-11-19 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Automatic error correction control method, device, equipment and medium in server
CN112306732B (en) * 2020-11-19 2023-02-28 山东云海国创云计算装备产业创新中心有限公司 Automatic error correction control method, device, equipment and medium in server
CN113010338A (en) * 2021-02-19 2021-06-22 山东英信计算机技术有限公司 Error leakage threshold value adjusting method, device, equipment and medium of memory CE
CN113204763A (en) * 2021-04-22 2021-08-03 山东英信计算机技术有限公司 Method, system, storage medium and device for receiving data by CPLD
WO2024066589A1 (en) * 2022-09-28 2024-04-04 超聚变数字技术有限公司 Processing method for hardware error reporting, and related device
CN115543678A (en) * 2022-11-30 2022-12-30 苏州浪潮智能科技有限公司 Method, system, storage medium and equipment for monitoring DDR5 memory granule errors
CN116841836A (en) * 2023-09-01 2023-10-03 四川华鲲振宇智能科技有限责任公司 One-key log collecting tool
CN116841836B (en) * 2023-09-01 2023-11-07 四川华鲲振宇智能科技有限责任公司 One-key log collecting tool
CN117076212A (en) * 2023-10-17 2023-11-17 北京卡普拉科技有限公司 Consistency check method, device, medium and equipment for MPI communication data content
CN117076212B (en) * 2023-10-17 2024-02-23 北京卡普拉科技有限公司 Consistency check method, device, medium and equipment for MPI communication data content

Similar Documents

Publication Publication Date Title
CN110674005A (en) Method and device for monitoring server memory and readable medium
CN111104238B (en) CE-based memory diagnosis method, device and medium
US8381039B1 (en) Storage of mass data for monitoring
CN111124722B (en) Method, equipment and medium for isolating fault memory
CN102135925B (en) Method and device for detecting error check and correcting memory
CN111090567A (en) Link alarm method, equipment and storage medium
CN109639501B (en) Abnormal record information storage method and device
CN111176932B (en) Method and device for recording abnormal event log and readable medium
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
CN110879773B (en) CGroup-based memory monitoring method and device
CN114860487A (en) Memory fault identification method and memory fault isolation method
US7664797B1 (en) Method and apparatus for using statistical process control within a storage management system
CN110752953A (en) Load balancing-based connection number updating method and device and storage medium
CN110825556A (en) Method and device for judging data storage state in solid state disk
CN113886175A (en) Hystrix-based distributed system cluster fusing method and distributed system
CN110855484B (en) Method, system, electronic device and storage medium for automatically detecting traffic change
CN115237656A (en) Error reporting processing method, device, equipment and storage medium of server
CN116028253A (en) Solid state disk error correction optimization method, system, equipment and medium
US11770328B2 (en) Network including data integrity monitoring
CN112671590B (en) Data transmission method and device, electronic equipment and computer storage medium
CN113568781A (en) Database error processing method and device and database cluster access system
CN115033927A (en) Method, device, equipment and medium for detecting data integrity
CN114003612A (en) Processing method and processing system for abnormal conditions of database
CN110333968B (en) Data management method and device applied to database and computer equipment
US8032797B1 (en) Storage of mass data for monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200110

WW01 Invention patent application withdrawn after publication