CN110674005A - A method, device and readable medium for monitoring server memory - Google Patents
A method, device and readable medium for monitoring server memory Download PDFInfo
- Publication number
- CN110674005A CN110674005A CN201910812159.6A CN201910812159A CN110674005A CN 110674005 A CN110674005 A CN 110674005A CN 201910812159 A CN201910812159 A CN 201910812159A CN 110674005 A CN110674005 A CN 110674005A
- Authority
- CN
- China
- Prior art keywords
- log
- correction information
- information
- time period
- error checking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012544 monitoring process Methods 0.000 title claims abstract description 22
- 238000012937 correction Methods 0.000 claims abstract description 66
- 230000004044 response Effects 0.000 claims abstract description 32
- 238000004590 computer program Methods 0.000 claims description 12
- 230000008439 repair process Effects 0.000 claims description 7
- 230000001960 triggered effect Effects 0.000 claims description 7
- 239000002699 waste material Substances 0.000 abstract description 4
- 238000001914 filtration Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004193 electrokinetic chromatography Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 235000000332 black box Nutrition 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3037—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1008—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
- G06F11/1044—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
本发明公开了一种监控服务器内存的方法、设备和介质,包括:接收错误检查和纠正信息并计数;判断在连续的第一预设时长内接收错误检查和纠正信息的次数是否达到第一阈值;响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中;判断在连续的第二预设时长内黑盒日志中的日志信息是否达到预定标准;以及响应于在连续的第二预设时长内黑盒日志中的日志信息达到预定标准,将日志信息记录到BMC系统日志中,以提醒用户更换内存。本发明提出的方案实现了对ECC监控的二次过滤,避免因产生少量ECC报错而更换内存带来的浪费及对客户业务的影响,又能保证当真正发生故障时能及时处理,保证系统运行的稳定性。
The invention discloses a method, device and medium for monitoring the memory of a server, comprising: receiving and counting error check and correction information; judging whether the number of times of receiving error check and correction information within a continuous first preset time period reaches a first threshold ; In response to the number of times of receiving error checking and correction information in the continuous first preset duration reaching the first threshold, the log information is recorded in the black box log; Whether the log information reaches a predetermined standard; and in response to the log information in the black box log reaching the predetermined standard within a second consecutive preset time period, recording the log information in the BMC system log to remind the user to replace the memory. The scheme proposed by the invention realizes the secondary filtering of ECC monitoring, avoids the waste caused by replacing the memory due to a small amount of ECC error reporting and the impact on customer business, and can also ensure timely processing when a real fault occurs, and ensure system operation. stability.
Description
技术领域technical field
本发明涉及服务器领域,更具体地,特别是指一种监控服务器内存的方法、设备及可读介质。The present invention relates to the field of servers, and more particularly, to a method, a device and a readable medium for monitoring the memory of a server.
背景技术Background technique
随着大数据、云计算、人工智能时代的到来,互联网业务量和数据量出现猛烈增长,数据成为当前很多关键技术的基础;在服务器系统中,随着用于存储和计算的数据量的增加,服务器的运载压力越来越大,这就对于整个数据链路稳定性的要求越来越高。在数据传输过程中,对于链路上的数据传输错误而言,少量的错误通过校验机制可以纠正过来,如果链路稳定性比较差,或者数据传输链路上硬件设备存在损伤,传输的错误数据会比较频繁且不断累积,轻微的会造成服务器的计算性能及存储性能降低,严重的会造成硬盘掉盘、系统宕机,由此所引发的后果则是客户的业务中断,数据丢失,造成的损失不可预估。With the advent of the era of big data, cloud computing, and artificial intelligence, the amount of Internet business and data has increased dramatically, and data has become the foundation of many key technologies. In server systems, with the increase in the amount of data used for storage and computing, The load pressure of the server is increasing, which requires higher and higher requirements for the stability of the entire data link. In the process of data transmission, for data transmission errors on the link, a small amount of errors can be corrected through the check mechanism. If the link stability is relatively poor, or the hardware equipment on the data transmission link is damaged, transmission errors Data will be frequently and continuously accumulated, which will slightly reduce the computing performance and storage performance of the server, and in serious cases will cause the hard disk to drop and the system to go down. losses are unpredictable.
服务器系统中,为了使系统运行的更趋于安全稳定,使用的内存都必须是具有高的稳定性和纠错性能的,与此相关的技术就是带有ECC(Error Correcting Code,错误检查和纠正)功能的内存;对于ECC技术而言,单个数据的错误是可以自己纠正过来并继续使用的,同时发生两个数据的错误时无法纠正会引起系统宕机。随着内存频率的增加,对信号传输的要求也越来越高,势必会带来单个数据错误数量的增加,但是这种单个数据的错误在一定程度上是被允许的,因为少量的错误不会影响系统的性能和正常运行。但是如果ECC数量较多,比如说内存本体有损伤,就会引起ECC在短时间内的风暴,这样就会在短时间内占用较多的资源去处理这些ECC错误,就会引起服务器的计算性能及存储性能降低,严重的会造成硬盘掉盘、系统宕机。当前的设计中对ECC进行监控,ECC数量超出BIOS阈值就会报警并更换内存,但是数量较少的ECC是不用特别处理的,这种机制会产生较大的维护成本;我们只需要针对ECC风暴进行监控判断即可,因为这种ECC风暴会引起性能的下降,是需要及时更换内存的。为此,我们需要主动去监控并判断ECC风暴的发生并及时的在监控日志中体现,避免因未能及时更换导致的对系统层面的影响。In the server system, in order to make the system run more safely and stably, the memory used must have high stability and error correction performance. The related technology is ECC (Error Correcting Code, error checking and correction). ) function memory; for ECC technology, the error of a single data can be corrected by itself and continue to be used, and the failure to correct the error of two data at the same time will cause the system to crash. With the increase of memory frequency, the requirements for signal transmission are also getting higher and higher, which will inevitably lead to an increase in the number of single data errors, but such single data errors are allowed to a certain extent, because a small amount of errors does not will affect the performance and normal operation of the system. However, if the number of ECCs is large, for example, the memory body is damaged, it will cause ECC storms in a short period of time, which will occupy more resources in a short period of time to deal with these ECC errors, which will cause the computing performance of the server. And the storage performance is reduced, which will seriously cause the hard disk to drop and the system to crash. In the current design, the ECC is monitored. If the number of ECC exceeds the BIOS threshold, an alarm will be issued and the memory will be replaced. However, a small number of ECCs do not need special treatment. This mechanism will generate a large maintenance cost; we only need to deal with ECC storms. It is enough to monitor and judge, because this ECC storm will cause performance degradation, and it is necessary to replace the memory in time. To this end, we need to actively monitor and judge the occurrence of ECC storms and reflect them in the monitoring log in time to avoid the impact on the system level caused by failure to replace them in time.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明实施例的目的在于提出一种监控服务器内存的方法及装置,主要是通过增加黑盒日志的记录环节,实现对ECC监控的二次过滤,达到ECC风暴监控的目的,确保能够将真正的故障识别出来,既能避免因产生少量ECC报错而更换内存带来的浪费及对客户业务的影响,又能保证当真正发生故障时能及时处理,保证系统运行的稳定性,提高客户满意度及产品的竞争力。In view of this, the purpose of the embodiments of the present invention is to propose a method and device for monitoring server memory, mainly by adding a recording link of the black box log, to realize the secondary filtering of ECC monitoring, to achieve the purpose of ECC storm monitoring, and to ensure The ability to identify real faults can not only avoid the waste of memory replacement due to a small number of ECC errors and the impact on customer business, but also ensure that when a real fault occurs, it can be dealt with in a timely manner, ensuring the stability of system operation and improving Customer satisfaction and product competitiveness.
基于上述目的,本发明实施例的一方面提供了一种监控服务器内存的方法,包括如下步骤:接收错误检查和纠正信息并计数;判断在连续的第一预设时长内接收错误检查和纠正信息的次数是否达到第一阈值;响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中;判断在连续的第二预设时长内黑盒日志中的日志信息是否达到预定标准;以及响应于在连续的第二预设时长内黑盒日志中的日志信息达到预定标准,将日志信息记录到BMC(Baseboard ManagementController,基板控制器)系统日志中,以提醒用户更换内存。Based on the above purpose, one aspect of the embodiments of the present invention provides a method for monitoring server memory, including the following steps: receiving and counting error checking and correction information; judging that the error checking and correction information is received within a continuous first preset time Whether the number of times reached the first threshold; in response to the number of times of receiving error checking and correction information reaching the first threshold within the continuous first preset duration, record the log information in the black box log; Whether the log information in the black box log reaches a predetermined standard within the duration; and in response to the log information in the black box log reaching the predetermined standard within a second consecutive preset duration, record the log information to the BMC (Baseboard Management Controller, baseboard controller). ) in the system log to remind the user to replace the memory.
在一些实施方式中,判断在连续的第二预设时长内黑盒日志中的日志信息是否达到预定标准包括:判断在连续的第二预设时长内接收错误检查和纠正信息的次数是否达到第二阈值。In some embodiments, judging whether the log information in the black box log reaches a predetermined standard within the second continuous preset time period includes: judging whether the number of times of receiving error checking and correction information within the second continuous second preset time period reaches the first Two thresholds.
在一些实施方式中,接收错误检查和纠正信息并计数包括:系统根据接收的错误检查和纠正信息对内存进行修复。In some embodiments, receiving and counting the error check and correction information includes: the system repairs the memory according to the received error check and correction information.
在一些实施方式中,响应于在连续的第二预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中还包括:响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,触发中断并将中断发送给BMC。In some implementations, in response to the number of times of receiving the error checking and correction information reaching the first threshold within the second consecutive preset time period, the logging of the log information into the black box log further comprises: in response to the continuous first preset time period When the number of times of receiving error checking and correction information reaches the first threshold within the set time period, an interrupt is triggered and the interrupt is sent to the BMC.
在一些实施方式中,还包括:判断内存是否更换;以及响应于内存已经更换,将接收错误检查和纠正信息的次数归零。In some embodiments, the method further includes: determining whether the memory has been replaced; and in response to the memory being replaced, zeroing the number of times of receiving error checking and correction information.
本发明实施例的另一方面,还提供了一种计算机设备,包括:至少一个处理器;以及存储器,存储器存储有可在处理器上运行的计算机指令,指令由处理器执行以实现如下步骤:接收错误检查和纠正信息并计数;判断在连续的第一预设时长内接收错误检查和纠正信息的次数是否达到第一阈值;响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中;判断在连续的第二预设时长内黑盒日志中的日志信息是否达到预定标准;以及响应于在连续的第二预设时长内黑盒日志中的日志信息达到预定标准,将日志信息记录到BMC系统日志中,以提醒用户更换内存。Another aspect of the embodiments of the present invention further provides a computer device, comprising: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to implement the following steps: Receive error check and correction information and count; determine whether the number of times of receiving error check and correction information within a continuous first preset time period reaches a first threshold; in response to receiving error check and correction information within a continuous first preset time period The number of times reaches the first threshold, the log information is recorded in the black box log; it is judged whether the log information in the black box log reaches the predetermined standard within the second continuous preset time period; and in response to the second continuous preset time period When the log information in the internal black box log reaches the predetermined standard, the log information is recorded in the BMC system log to remind the user to replace the memory.
在一些实施方式中,判断在连续的第二预设时长内黑盒日志中的日志信息是否达到预定标准包括:判断在连续的第二预设时长内接收错误检查和纠正信息的次数是否达到第二阈值。In some embodiments, judging whether the log information in the black box log reaches a predetermined standard within the second continuous preset time period includes: judging whether the number of times of receiving error checking and correction information within the second continuous second preset time period reaches the first Two thresholds.
在一些实施方式中,接收错误检查和纠正信息并计数包括:系统根据接收的错误检查和纠正信息对内存进行修复。In some embodiments, receiving and counting the error check and correction information includes: the system repairs the memory according to the received error check and correction information.
在一些实施方式中,响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中还包括:响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,触发中断并将中断发送给BMC。In some implementations, in response to the number of times of receiving the error checking and correction information reaching the first threshold within the continuous first preset time period, recording the log information into the black box log further comprises: in response to the continuous first preset time period When the number of times of receiving error checking and correction information reaches the first threshold within the set time period, an interrupt is triggered and the interrupt is sent to the BMC.
本发明实施例的再一方面,还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时执行如上方法的计算机程序。In yet another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, where the computer-readable storage medium stores a computer program that executes the above method when executed by a processor.
本发明具有以下有益技术效果:通过增加黑盒日志的记录环节,实现对ECC监控的二次过滤,达到ECC风暴监控的目的,确保能够将真正的故障识别出来,既能避免因产生少量ECC报错而更换内存带来的浪费及对客户业务的影响,又能保证当真正发生故障时能及时处理,保证系统运行的稳定性,提高客户满意度及产品的竞争力。The invention has the following beneficial technical effects: by adding the recording link of the black box log, the secondary filtering of the ECC monitoring is realized, the purpose of the ECC storm monitoring is achieved, the real fault can be identified, and the error reporting due to a small amount of ECC can be avoided. The waste caused by replacing the memory and the impact on the customer's business can also ensure that when a real failure occurs, it can be handled in time, ensure the stability of the system operation, and improve customer satisfaction and product competitiveness.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other embodiments can also be obtained according to these drawings without creative efforts.
图1为本发明提供的监控服务器内存的方法的实施例的示意图;1 is a schematic diagram of an embodiment of a method for monitoring server memory provided by the present invention;
图2为本发明提供的监控服务器内存的方法的实施例的流程图。FIG. 2 is a flowchart of an embodiment of a method for monitoring server memory provided by the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明实施例进一步详细说明。In order to make the objectives, technical solutions and advantages of the present invention more clearly understood, the embodiments of the present invention will be further described in detail below with reference to the specific embodiments and the accompanying drawings.
需要说明的是,本发明实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本发明实施例的限定,后续实施例对此不再一一说明。It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are for the purpose of distinguishing two entities with the same name but not the same or non-identical parameters. It can be seen that "first" and "second" It is only for the convenience of expression and should not be construed as a limitation to the embodiments of the present invention, and subsequent embodiments will not describe them one by one.
基于上述目的,本发明实施例的第一个方面,提出了一种监控服务器内存的方法的实施例。图1示出的是本发明提供的监控服务器内存的方法的实施例的示意图。如图1所示,本发明实施例包括如下步骤:Based on the above objective, in a first aspect of the embodiments of the present invention, an embodiment of a method for monitoring server memory is provided. FIG. 1 shows a schematic diagram of an embodiment of a method for monitoring server memory provided by the present invention. As shown in Figure 1, the embodiment of the present invention includes the following steps:
S1、接收错误检查和纠正信息并计数;S1. Receive and count error checking and correction information;
S2、判断在连续的第一预设时长内接收错误检查和纠正信息的次数是否达到第一阈值;S2, judging whether the number of times of receiving error checking and correction information in a continuous first preset time length reaches a first threshold;
S3、响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中;S3, in response to the number of times of receiving the error check and correction information in the continuous first preset time length reaching the first threshold, record the log information in the black box log;
S4、判断在连续的第二预设时长内所述黑盒日志中的日志信息是否达到预定标准;以及S4, judging whether the log information in the black box log reaches a predetermined standard within a continuous second preset duration; and
S5、响应于在连续的第二预设时长内所述黑盒日志中的日志信息达到预定标准,将日志信息记录到BMC系统日志中,以提醒用户更换内存。S5. In response to the log information in the black box log reaching a predetermined standard within a second consecutive preset time period, record the log information in the BMC system log to remind the user to replace the memory.
接收错误检查和纠正信息并计数。在BIOS(Basic Input Output System,基本输入输出系统)中设定第一阈值,检测系统的内存,当出现单个数据(例如一个比特)错误时会产生错误检查和纠正信息,BIOS接收错误检查和纠正信息并计数。在某些实施例中,接收错误检查和纠正信息并计数包括:系统根据接收的错误检查和纠正信息对内存进行修复。Error checking and correction messages are received and counted. The first threshold is set in the BIOS (Basic Input Output System), the memory of the system is detected, and error checking and correction information is generated when a single data (eg, one bit) error occurs, and the BIOS receives the error checking and correction. information and counting. In some embodiments, receiving and counting the error check and correction information includes: the system repairs the memory according to the received error check and correction information.
在某些实施例中,可以定时对计数器进行归零。在某些情形下,由于系统内部可以修复部分错误,当数据出现错误的频率不高时,系统内存可以正常使用,如果不清零,接收错误检查和纠正信息的次数总有达到第一阈值的时候,为了避免出现这种情形,可以每隔一段时间对计数器进行一次清零。In some embodiments, the counter may be reset to zero periodically. In some cases, since some errors can be repaired inside the system, when the frequency of data errors is not high, the system memory can be used normally. If it is not cleared, the number of times of receiving error checking and correction information always reaches the first threshold. Sometimes, in order to avoid this situation, the counter can be reset to zero every once in a while.
判断在连续的第一预设时长内接收错误检查和纠正信息的次数是否达到第一阈值。响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中。在某些实施例中,响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中还包括:响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,触发中断,并将所述中断发送给BMC。BMC接收到中断后会将中断记录到黑盒日志中。It is judged whether the number of times of receiving error checking and correction information within a continuous first preset time period reaches a first threshold. The log information is recorded in the black box log in response to the number of times of receiving the error checking and correction information reaching the first threshold within the continuous first preset time period. In some embodiments, in response to the number of times of receiving the error checking and correction information reaching the first threshold within the consecutive first preset time period, recording the log information into the black-box log further comprises: in response to the consecutive first When the number of times of receiving error checking and correction information within a preset time period reaches a first threshold, an interrupt is triggered, and the interrupt is sent to the BMC. After the BMC receives the interrupt, it will record the interrupt in the black box log.
判断在连续的第二预设时长内所述黑盒日志中的日志信息是否达到预定标准。在某些实施例中,判断在连续的第二预设时长内所述黑盒日志中的日志信息是否达到预定标准包括:判断在连续的第二预设时长内接收错误检查和纠正信息的次数是否达到第二阈值。可以统计一定时间内黑盒日志中记录的次数,例如当半个小时内报错数量大于5次时,则认为存在ECC风暴。响应于所述黑盒日志中的日志信息达到预定标准,将日志信息记录到BMC系统日志中,以提醒用户更换内存。It is judged whether the log information in the black box log reaches a predetermined standard within a second continuous preset time period. In some embodiments, judging whether the log information in the black box log meets a predetermined standard within a second continuous preset time period includes: judging the number of times of receiving error checking and correction information within the second continuous second preset time period whether the second threshold is reached. You can count the number of records in the black box log within a certain period of time. For example, when the number of errors reported within half an hour is greater than 5, it is considered that there is an ECC storm. In response to the log information in the black box log reaching a predetermined standard, the log information is recorded in the BMC system log to remind the user to replace the memory.
在某些实施例中,上述方法还包括:判断内存是否更换;以及响应于内存已经更换,将接收错误检查和纠正信息的次数归零。In some embodiments, the above method further includes: determining whether the memory has been replaced; and in response to the memory having been replaced, zeroing the number of times of receiving error checking and correction information.
本发明提出了一种通过BIOS和黑盒日志实现风暴监控的方法。在原有机制的基础上进行优化设计,通过BIOS和黑盒日志的双重监控,实现风暴监控的目的。具体来讲,首先,BIOS中对于ECC的阈值进行设定,当阈值达到后会触发中断给到BMC;其次,BMC收到中断之后会先记录在黑盒日志中,此时用户感知不到有ECC错误发生;最后,统计一定时间内黑盒日志中记录的次数,例如当半个小时内报错数量大于5次时,则认为存在ECC风暴,就会记录到BMC的系统日志中,此时终端用户就可以监控到ECC。这样,就可以通过查看BMC的系统日志中是否有对应的报错来决定是否更换内存。既能避免因产生少量ECC报错而更换内存带来的浪费及对客户业务的影响,又能保证当真正发生故障时能及时处理,保证系统运行的稳定性,提高客户满意度及产品的竞争力。The invention proposes a method for realizing storm monitoring through BIOS and black box log. On the basis of the original mechanism, the optimization design is carried out, and the purpose of storm monitoring is realized through double monitoring of BIOS and black box log. Specifically, first, the ECC threshold is set in the BIOS. When the threshold is reached, an interrupt will be triggered and sent to the BMC; secondly, after the BMC receives the interrupt, it will be recorded in the black box log first. At this time, the user cannot perceive any ECC error occurs; finally, count the number of records in the black box log within a certain period of time. For example, when the number of errors reported within half an hour is greater than 5, it is considered that there is an ECC storm, and it will be recorded in the system log of the BMC. At this time, the terminal The user can monitor the ECC. In this way, you can decide whether to replace the memory by checking whether there is a corresponding error in the system log of the BMC. It can not only avoid the waste of memory replacement due to a small number of ECC errors and the impact on customer business, but also ensure that when a real failure occurs, it can be handled in time, ensure the stability of system operation, and improve customer satisfaction and product competitiveness. .
图2示出的是本发明提供的监控服务器内存的方法的实施例的流程图。如图2所示,从框101开始,接着前进到框102,接收错误检查和纠正信息并计数;接着前进到框103,系统根据接收的错误检查和纠正信息对内存进行修复;接着前进到框104,判断接收错误检查和纠正信息的次数是否达到第一阈值,如果是,前进到框105,将日志信息记录到黑盒日志中,如果否,返回框102;在将日志信息记录到黑盒日志中后,可以前进到框106,判断黑盒日志中的日志信息是否达到预定标准,如果是,前进到框107,将日志信息记录到BMC系统日志中,然后前进到框108结束,如果否,返回框105。FIG. 2 shows a flowchart of an embodiment of a method for monitoring server memory provided by the present invention. As shown in Figure 2, start from
需要特别指出的是,上述监控服务器内存的方法的各个实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于监控服务器内存的方法也应当属于本发明的保护范围,并且不应将本发明的保护范围局限在实施例之上。It should be particularly pointed out that the steps in each embodiment of the above-mentioned method for monitoring server memory can be crossed, replaced, added, and deleted. The protection scope of the present invention should not be limited to the embodiments.
基于上述目的,本发明实施例的第二个方面,提出了一种计算机设备,包括:至少一个处理器;以及存储器,存储器存储有可在处理器上运行的计算机指令,指令由处理器执行以实现如下步骤:S1、接收错误检查和纠正信息并计数;S2、判断在连续的第一预设时长内接收错误检查和纠正信息的次数是否达到第一阈值;S3、响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中;S4、判断在连续的第二预设时长内所述黑盒日志中的日志信息是否达到预定标准;以及S5、响应于在连续的第二预设时长内所述黑盒日志中的日志信息达到预定标准,将日志信息记录到BMC系统日志中,以提醒用户更换内存。Based on the above objective, in a second aspect of the embodiments of the present invention, a computer device is provided, including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to Implement the following steps: S1, receive error check and correction information and count; S2, judge whether the number of times of receiving error check and correction information in a continuous first preset time length reaches a first threshold; S3, in response to the continuous first The number of times of receiving error checking and correction information within a preset time period reaches the first threshold, and the log information is recorded in the black box log; S4, judging whether the log information in the black box log reaches the and S5. In response to the log information in the black box log reaching the predetermined standard within a second consecutive preset time period, record the log information in the BMC system log to remind the user to replace the memory.
在一些实施方式中,判断在连续的第二预设时长内黑盒日志中的日志信息是否达到预定标准包括:判断在连续的第二预设时长内接收错误检查和纠正信息的次数是否达到第二阈值。In some embodiments, judging whether the log information in the black box log reaches a predetermined standard within the second continuous preset time period includes: judging whether the number of times of receiving error checking and correction information within the second continuous second preset time period reaches the first Two thresholds.
在一些实施方式中,接收错误检查和纠正信息并计数包括:系统根据接收的错误检查和纠正信息对内存进行修复。In some embodiments, receiving and counting the error check and correction information includes: the system repairs the memory according to the received error check and correction information.
在一些实施方式中,响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,将日志信息记录到黑盒日志中还包括:响应于在连续的第一预设时长内接收错误检查和纠正信息的次数达到第一阈值,触发中断,并将中断发送给BMC。In some implementations, in response to the number of times of receiving the error checking and correction information reaching the first threshold within the continuous first preset time period, recording the log information into the black box log further comprises: in response to the continuous first preset time period When the number of received error checking and correction information reaches the first threshold within the set time period, an interrupt is triggered and the interrupt is sent to the BMC.
在一些实施方式中,还包括:判断内存是否更换;以及响应于内存已经更换,将接收错误检查和纠正信息的次数归零。In some embodiments, the method further includes: determining whether the memory has been replaced; and in response to the memory being replaced, zeroing the number of times of receiving error checking and correction information.
本发明还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时执行如上方法的计算机程序。The present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program that executes the above method when executed by a processor.
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,监控服务器内存的方法的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,程序的存储介质可为磁碟、光盘、只读存储记忆体(ROM)或随机存储记忆体(RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。Finally, it should be noted that those of ordinary skill in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program of the method for monitoring the memory of the server can be stored in a computer readable In the storage medium, when the program is executed, it may include the processes of the foregoing method embodiments. Wherein, the storage medium of the program may be a magnetic disk, an optical disk, a read only memory (ROM) or a random access memory (RAM) or the like. The above computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.
此外,根据本发明实施例公开的方法还可以被实现为由处理器执行的计算机程序,该计算机程序可以存储在计算机可读存储介质中。在该计算机程序被处理器执行时,执行本发明实施例公开的方法中限定的上述功能。In addition, the methods disclosed according to the embodiments of the present invention may also be implemented as a computer program executed by a processor, and the computer program may be stored in a computer-readable storage medium. When the computer program is executed by the processor, the above-mentioned functions defined in the methods disclosed in the embodiments of the present invention are executed.
此外,上述方法步骤以及系统单元也可以利用控制器以及用于存储使得控制器实现上述步骤或单元功能的计算机程序的计算机可读存储介质实现。In addition, the above-mentioned method steps and system units can also be implemented by using a controller and a computer-readable storage medium for storing a computer program that enables the controller to implement the functions of the above-mentioned steps or units.
此外,应该明白的是,本文的计算机可读存储介质(例如,存储器)可以是易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。作为例子而非限制性的,非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦写可编程ROM(EEPROM)或快闪存储器。易失性存储器可以包括随机存取存储器(RAM),该RAM可以充当外部高速缓存存储器。作为例子而非限制性的,RAM可以以多种形式获得,比如同步RAM(DRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据速率SDRAM(DDRSDRAM)、增强SDRAM(ESDRAM)、同步链路DRAM(SLDRAM)、以及直接Rambus RAM(DRRAM)。所公开的方面的存储设备意在包括但不限于这些和其它合适类型的存储器。In addition, it should be understood that computer-readable storage media (eg, memory) herein can be volatile memory or non-volatile memory, or can include both volatile and non-volatile memory. By way of example and not limitation, nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory memory. Volatile memory may include random access memory (RAM), which may act as external cache memory. By way of example and not limitation, RAM is available in various forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to include, but not be limited to, these and other suitable types of memory.
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于具体应用以及施加给整个系统的设计约束。本领域技术人员可以针对每种具体应用以各种方式来实现的功能,但是这种实现决定不应被解释为导致脱离本发明实施例公开的范围。Those skilled in the art will also appreciate that the various exemplary logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends on the specific application and design constraints imposed on the overall system. Those skilled in the art may implement the functions in various ways for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
结合这里的公开所描述的各种示例性逻辑块、模块和电路可以利用被设计成用于执行这里功能的下列部件来实现或执行:通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑器件、分立门或晶体管逻辑、分立的硬件组件或者这些部件的任何组合。通用处理器可以是微处理器,但是可替换地,处理器可以是任何传统处理器、控制器、微控制器或状态机。处理器也可以被实现为计算设备的组合,例如,DSP和微处理器的组合、多个微处理器、一个或多个微处理器结合DSP和/或任何其它这种配置。The various exemplary logical blocks, modules, and circuits described in connection with the disclosure herein can be implemented or executed using the following components designed to perform the functions herein: general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in combination with a DSP, and/or any other such configuration.
结合这里的公开所描述的方法或算法的步骤可以直接包含在硬件中、由处理器执行的软件模块中或这两者的组合中。软件模块可以驻留在RAM存储器、快闪存储器、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动盘、CD-ROM、或本领域已知的任何其它形式的存储介质中。示例性的存储介质被耦合到处理器,使得处理器能够从该存储介质中读取信息或向该存储介质写入信息。在一个替换方案中,存储介质可以与处理器集成在一起。处理器和存储介质可以驻留在ASIC中。ASIC可以驻留在用户终端中。在一个替换方案中,处理器和存储介质可以作为分立组件驻留在用户终端中。The steps of a method or algorithm described in connection with the disclosures herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. In an alternative, the storage medium may be integrated with the processor. The processor and storage medium may reside in an ASIC. The ASIC may reside in the user terminal. In an alternative, the processor and storage medium may reside in the user terminal as discrete components.
在一个或多个示例性设计中,功能可以在硬件、软件、固件或其任意组合中实现。如果在软件中实现,则可以将功能作为一个或多个指令或代码存储在计算机可读介质上或通过计算机可读介质来传送。计算机可读介质包括计算机存储介质和通信介质,该通信介质包括有助于将计算机程序从一个位置传送到另一个位置的任何介质。存储介质可以是能够被通用或专用计算机访问的任何可用介质。作为例子而非限制性的,该计算机可读介质可以包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储设备、磁盘存储设备或其它磁性存储设备,或者是可以用于携带或存储形式为指令或数据结构的所需程序代码并且能够被通用或专用计算机或者通用或专用处理器访问的任何其它介质。此外,任何连接都可以适当地称为计算机可读介质。例如,如果使用同轴线缆、光纤线缆、双绞线、数字用户线路(DSL)或诸如红外线、无线电和微波的无线技术来从网站、服务器或其它远程源发送软件,则上述同轴线缆、光纤线缆、双绞线、DSL或诸如红外线、无线电和微波的无线技术均包括在介质的定义。如这里所使用的,磁盘和光盘包括压缩盘(CD)、激光盘、光盘、数字多功能盘(DVD)、软盘、蓝光盘,其中磁盘通常磁性地再现数据,而光盘利用激光光学地再现数据。上述内容的组合也应当包括在计算机可读介质的范围内。In one or more exemplary designs, functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium can be any available medium that can be accessed by a general purpose or special purpose computer. By way of example and not limitation, the computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, or may be used to carry or store instructions in the form of or data structures and any other medium that can be accessed by a general purpose or special purpose computer or a general purpose or special purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are used to send software from a website, server, or other remote source, the above coaxial cable Cable, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave are all included in the definition of medium. As used herein, magnetic disks and optical disks include compact disks (CDs), laser disks, optical disks, digital versatile disks (DVDs), floppy disks, blu-ray disks, where disks usually reproduce data magnetically, while optical disks reproduce data optically with lasers . Combinations of the above should also be included within the scope of computer-readable media.
以上是本发明公开的示例性实施例,但是应当注意,在不背离权利要求限定的本发明实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本发明实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。The above are exemplary embodiments of the present disclosure, but it should be noted that various changes and modifications may be made without departing from the scope of the disclosure of the embodiments of the present invention as defined in the claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements disclosed in the embodiments of the present invention may be described or claimed in the singular, unless explicitly limited to the singular, the plural may also be construed.
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。It should be understood that, as used herein, the singular form "a" is intended to include the plural form as well, unless the context clearly supports an exception. It will also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
上述本发明实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned embodiments of the present invention disclose the serial numbers of the embodiments only for description, and do not represent the advantages and disadvantages of the embodiments.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本发明实施例公开的范围(包括权利要求)被限于这些例子;在本发明实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本发明实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本发明实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本发明实施例的保护范围之内。Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope (including the claims) disclosed by the embodiments of the present invention is limited to these examples; under the idea of the embodiments of the present invention , the technical features in the above embodiments or different embodiments can also be combined, and there are many other changes in different aspects of the above embodiments of the present invention, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present invention should be included within the protection scope of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910812159.6A CN110674005A (en) | 2019-08-30 | 2019-08-30 | A method, device and readable medium for monitoring server memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910812159.6A CN110674005A (en) | 2019-08-30 | 2019-08-30 | A method, device and readable medium for monitoring server memory |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110674005A true CN110674005A (en) | 2020-01-10 |
Family
ID=69075903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910812159.6A Withdrawn CN110674005A (en) | 2019-08-30 | 2019-08-30 | A method, device and readable medium for monitoring server memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110674005A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111290918A (en) * | 2020-02-26 | 2020-06-16 | 苏州浪潮智能科技有限公司 | Server running state monitoring method, device and computer-readable storage medium |
CN111367699A (en) * | 2020-02-23 | 2020-07-03 | 苏州浪潮智能科技有限公司 | Method, system, device and medium for processing error information |
CN111475361A (en) * | 2020-04-17 | 2020-07-31 | 苏州浪潮智能科技有限公司 | Test method, system, device and readable medium for server video card |
CN111930553A (en) * | 2020-07-15 | 2020-11-13 | 烽火通信科技股份有限公司 | Server memory fault early warning method and system |
CN112131039A (en) * | 2020-09-18 | 2020-12-25 | 苏州浪潮智能科技有限公司 | A method, device, device and medium for reporting and controlling memory ECC information |
CN112231130A (en) * | 2020-09-18 | 2021-01-15 | 苏州浪潮智能科技有限公司 | A method, system, device and medium for locating faults according to logs |
CN112306732A (en) * | 2020-11-19 | 2021-02-02 | 山东云海国创云计算装备产业创新中心有限公司 | Automatic error correction control method, device, device and medium in a server |
CN113010338A (en) * | 2021-02-19 | 2021-06-22 | 山东英信计算机技术有限公司 | Error leakage threshold value adjusting method, device, equipment and medium of memory CE |
CN113204763A (en) * | 2021-04-22 | 2021-08-03 | 山东英信计算机技术有限公司 | Method, system, storage medium and device for receiving data by CPLD |
CN113297038A (en) * | 2020-04-07 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Monitoring and data processing method, device and equipment, monitoring system and data center |
CN115543678A (en) * | 2022-11-30 | 2022-12-30 | 苏州浪潮智能科技有限公司 | Method, system, storage medium and equipment for monitoring DDR5 memory granule errors |
CN116841836A (en) * | 2023-09-01 | 2023-10-03 | 四川华鲲振宇智能科技有限责任公司 | One-key log collecting tool |
CN117076212A (en) * | 2023-10-17 | 2023-11-17 | 北京卡普拉科技有限公司 | Consistency check method, device, medium and equipment for MPI communication data content |
WO2024066589A1 (en) * | 2022-09-28 | 2024-04-04 | 超聚变数字技术有限公司 | Processing method for hardware error reporting, and related device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108958965A (en) * | 2018-06-28 | 2018-12-07 | 郑州云海信息技术有限公司 | A kind of BMC monitoring can restore the method, device and equipment of ECC error |
CN109328340A (en) * | 2017-09-30 | 2019-02-12 | 华为技术有限公司 | Memory failure detection method, device and server |
-
2019
- 2019-08-30 CN CN201910812159.6A patent/CN110674005A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109328340A (en) * | 2017-09-30 | 2019-02-12 | 华为技术有限公司 | Memory failure detection method, device and server |
CN108958965A (en) * | 2018-06-28 | 2018-12-07 | 郑州云海信息技术有限公司 | A kind of BMC monitoring can restore the method, device and equipment of ECC error |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111367699A (en) * | 2020-02-23 | 2020-07-03 | 苏州浪潮智能科技有限公司 | Method, system, device and medium for processing error information |
CN111290918A (en) * | 2020-02-26 | 2020-06-16 | 苏州浪潮智能科技有限公司 | Server running state monitoring method, device and computer-readable storage medium |
CN113297038A (en) * | 2020-04-07 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Monitoring and data processing method, device and equipment, monitoring system and data center |
CN111475361A (en) * | 2020-04-17 | 2020-07-31 | 苏州浪潮智能科技有限公司 | Test method, system, device and readable medium for server video card |
CN111930553A (en) * | 2020-07-15 | 2020-11-13 | 烽火通信科技股份有限公司 | Server memory fault early warning method and system |
CN111930553B (en) * | 2020-07-15 | 2022-05-20 | 烽火通信科技股份有限公司 | Server memory fault early warning method and system |
CN112131039B (en) * | 2020-09-18 | 2023-01-10 | 苏州浪潮智能科技有限公司 | A memory ECC information reporting control method, device, equipment and medium |
CN112231130A (en) * | 2020-09-18 | 2021-01-15 | 苏州浪潮智能科技有限公司 | A method, system, device and medium for locating faults according to logs |
CN112131039A (en) * | 2020-09-18 | 2020-12-25 | 苏州浪潮智能科技有限公司 | A method, device, device and medium for reporting and controlling memory ECC information |
CN112306732A (en) * | 2020-11-19 | 2021-02-02 | 山东云海国创云计算装备产业创新中心有限公司 | Automatic error correction control method, device, device and medium in a server |
CN112306732B (en) * | 2020-11-19 | 2023-02-28 | 山东云海国创云计算装备产业创新中心有限公司 | Automatic error correction control method, device, equipment and medium in a server |
CN113010338A (en) * | 2021-02-19 | 2021-06-22 | 山东英信计算机技术有限公司 | Error leakage threshold value adjusting method, device, equipment and medium of memory CE |
CN113204763A (en) * | 2021-04-22 | 2021-08-03 | 山东英信计算机技术有限公司 | Method, system, storage medium and device for receiving data by CPLD |
WO2024066589A1 (en) * | 2022-09-28 | 2024-04-04 | 超聚变数字技术有限公司 | Processing method for hardware error reporting, and related device |
CN115543678A (en) * | 2022-11-30 | 2022-12-30 | 苏州浪潮智能科技有限公司 | Method, system, storage medium and equipment for monitoring DDR5 memory granule errors |
CN116841836A (en) * | 2023-09-01 | 2023-10-03 | 四川华鲲振宇智能科技有限责任公司 | One-key log collecting tool |
CN116841836B (en) * | 2023-09-01 | 2023-11-07 | 四川华鲲振宇智能科技有限责任公司 | One-key log collecting tool |
CN117076212A (en) * | 2023-10-17 | 2023-11-17 | 北京卡普拉科技有限公司 | Consistency check method, device, medium and equipment for MPI communication data content |
CN117076212B (en) * | 2023-10-17 | 2024-02-23 | 北京卡普拉科技有限公司 | Consistency check method, device, medium and equipment for MPI communication data content |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674005A (en) | A method, device and readable medium for monitoring server memory | |
CN111104238B (en) | CE-based memory diagnosis method, device and medium | |
CN111008091A (en) | A fault processing method, system and related device for memory CE | |
CN111858240B (en) | A monitoring method, system, device and medium of a distributed storage system | |
CN110908838B (en) | Data processing method and device, electronic equipment and storage medium | |
CN111221775B (en) | Processor, cache processing method and electronic equipment | |
CN111124722B (en) | Method, equipment and medium for isolating fault memory | |
CN110825556B (en) | Method and device for judging data storage state in solid state disk | |
CN111090567A (en) | A link alarm method, device and storage medium | |
CN110752953A (en) | A method, device and storage medium for updating the number of connections based on load balancing | |
CN111258856A (en) | A method, system, device and medium for monitoring the running state of a solid state hard disk | |
CN102135925A (en) | Method and device for detecting error check and correcting memory | |
CN110674003A (en) | Early warning method and device for server hardware and storage medium | |
CN110413492B (en) | Method for health assessment of a storage disk, electronic device and computer program product | |
US9086990B2 (en) | Bitline deletion | |
CN108958965B (en) | Method, device and equipment for monitoring recoverable ECC errors by BMC | |
CN117076186A (en) | Memory fault detection method, system, device, medium and server | |
CN113485872A (en) | Fault processing method and device and distributed storage system | |
CN116775362A (en) | Method and system for processing path blocking of redundant array of independent disks | |
US10268598B2 (en) | Primary memory module with record of usage history | |
CN110659147B (en) | Self-repairing method and system based on module self-checking behavior | |
CN111309651A (en) | A method, device, device and readable medium for outputting serial port debugging information | |
CN103390429B (en) | The online test method of a kind of hard disk and server | |
CN116954986A (en) | A memory correctable error storm fault tolerance method, device and medium | |
CN116028253A (en) | A solid state disk error correction optimization method, system, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200110 |