CN110598802B

CN110598802B - Memory detection model training method, memory detection method and device

Info

Publication number: CN110598802B
Application number: CN201910918511.4A
Authority: CN
Inventors: 叶茂; 李靖; 叶铮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2021-07-27
Anticipated expiration: 2039-09-26
Also published as: CN110598802A; CN111078479A; CN111078479B

Abstract

The application discloses a method for training a memory detection model, which comprises the following steps: obtaining a memory state historical data set, generating a real fault label set according to the memory state historical data set, generating a feature set to be trained according to the memory state historical data set, training a memory detection model to be trained according to the feature set to be trained to obtain a predicted fault label set, and training the memory detection model to be trained according to the memory detection model to be trained if the predicted fault label set and the real fault label set meet model verification conditions. The application also discloses a memory detection method and a memory detection device. The memory detection model provided by the application can predict the memory fault condition aiming at the granularity of the memory module level, fully considers the health condition and the risk level of the memory, and accordingly improves the fault positioning accuracy of memory detection.

Description

Memory detection model training method, memory detection method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method for training a memory detection model, a method and an apparatus for memory detection.

Background

With the development of science and technology, computers have entered into thousands of households. The hardware system of the computer is composed of an arithmetic unit, a controller, a memory, an input device and an output device. The memory in a computer is divided into internal memory and external memory. The memory is used to store programs and data that are currently in use or are needed at any time. Once a memory error or fault occurs, a program cannot work normally or is down. Therefore, the method has important significance for researching possible faults of the memory.

At present, the industry generally adopts a fault matching model to detect a memory, that is, error information such as Correctable Errors (CE), Uncorrectable Errors (UE), event logs related to memory reliability and the like is extracted from a server event log, and then sensor data is taken from a system level, mainly including fan speed, command number per second, memory and network bandwidth, power saving, clock frequency, temperature and the like, and the fault matching model is obtained through training.

However, the above fault matching model can only predict the UE for the granularity of the system level, and the granularity of the prediction mode based on the system level is large, so it is difficult to locate a specific memory module, resulting in low accuracy of memory detection.

Disclosure of Invention

The embodiment of the application provides a memory detection model training method, a memory detection method and a memory detection device.

In view of the above, a first aspect of the present application provides a method for training a memory detection model, including:

acquiring a memory state historical data set, wherein the memory state historical data set comprises M memory state historical data, each memory state historical data corresponds to a dual in-line memory module (DIMM), and M is an integer greater than or equal to 1;

generating a real fault label set according to the memory state historical data set, wherein the real fault label set comprises M real fault labels, and each real fault label corresponds to one DIMM;

generating a feature set to be trained according to the memory state historical data set, wherein the feature set to be trained comprises M features to be trained, each feature to be trained corresponds to one DIMM, and each feature to be trained comprises at least one parameter corresponding to a feature index;

training a memory detection model to be trained according to the feature set to be trained to obtain a predicted fault label set, wherein the predicted fault label set comprises M predicted fault labels, and each predicted fault label corresponds to one DIMM;

and if the predicted fault label set and the real fault label set meet the model verification condition, determining the memory detection model to be trained as a qualified memory detection model.

A second aspect of the present application provides a method for memory detection, including:

acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to a dual in-line memory module (DIMM);

if the fact that the UE with the uncorrectable error does not appear on the DIMM is detected according to the log data to be detected, generating a feature vector to be detected according to the DIMM;

obtaining N fault probability scores corresponding to the feature vectors to be detected through a memory detection model, wherein the memory detection model comprises N memory detection submodels, each memory detection submodel corresponds to one fault probability score, and N is an integer greater than or equal to 1;

and if the N fault probability scores meet the fault detection condition, generating a memory detection result corresponding to the DIMM.

A third aspect of the present application provides a method for memory detection, including:

acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to K dual in-line memory modules (DIMMs), and K is an integer greater than or equal to 1;

if the UE with uncorrectable errors is detected to have occurred to P DIMMs according to the log data to be detected, generating a feature vector set to be detected according to the rest DIMMs, wherein the rest DIMMs are the DIMMs left after the P DIMMs are removed from the K DIMMs, the rest DIMMs comprise (K-P) DIMMs, P is an integer which is greater than or equal to 0 and less than or equal to K, and the feature vector set to be detected comprises (K-P) feature vectors to be detected;

acquiring (K-P) failure probability score sets corresponding to the feature vector sets to be detected through a memory detection model, wherein each failure probability score set comprises N failure probability scores, the memory detection model comprises N memory detection submodels, each memory detection submodel corresponds to one failure probability score, and N is an integer greater than or equal to 1;

determining T DIMMs from the (K-P) DIMMs based on the set of (K-P) failure probability scores, wherein T is an integer greater than or equal to 0 and less than or equal to the (K-P);

and generating a memory detection result according to the T DIMMs and the P DIMMs.

The fourth aspect of the present application provides a memory detection model training device, including:

the memory state history data collection comprises M memory state history data, each memory state history data corresponds to a dual in-line memory module (DIMM), and M is an integer greater than or equal to 1;

a generating module, configured to generate a real fault tag set according to the memory state historical data set acquired by the acquiring module, where the real fault tag set includes M real fault tags, and each real fault tag corresponds to one DIMM;

the generating module is further configured to generate a feature set to be trained according to the memory state historical data set, where the feature set to be trained includes M features to be trained, each feature to be trained corresponds to one DIMM, and each feature to be trained includes at least one parameter corresponding to a feature index;

the obtaining module is further configured to train a to-be-trained memory detection model according to the to-be-trained feature set to obtain a predicted fault tag set, where the predicted fault tag set includes M predicted fault tags, and each predicted fault tag corresponds to one DIMM;

and the training module is used for determining the memory detection model to be trained as a qualified memory detection model if the predicted fault label set and the real fault label set acquired by the acquisition module meet a model verification condition.

In one possible design, in a first implementation of the fourth aspect of the embodiments of the present application,

the acquisition module is specifically configured to acquire log data and a memory fault list, where the log data includes information related to a memory, and the memory fault list includes information related to a fault;

acquiring a first DIMM set according to the log data and the memory fault list, wherein the first DIMM set comprises at least one first DIMM, and the first DIMM is a DIMM with uncorrectable error UE;

generating the memory state history data set from the first DIMM set.

In one possible design, in a second implementation of the fourth aspect of the embodiments of the present application,

the acquiring module is specifically configured to acquire a memory fault list, where the memory fault list includes information related to a fault;

acquiring a second DIMM set according to the memory fault list, wherein the second DIMM set comprises at least one second DIMM, and the second DIMM is a DIMM with an established memory fault list;

generating the memory state history data set from the second set of DIMMs.

In one possible design, in a third implementation of the fourth aspect of the embodiments of the present application,

acquiring a third DIMM set according to the log data and the memory fault list, wherein the third DIMM set comprises at least one third DIMM, and the third DIMM is a DIMM with uncorrectable error UE or a DIMM with an established memory fault list;

generating the memory state history data set from the third set of DIMMs.

In one possible design, in a fourth implementation of the fourth aspect of the embodiment of the present application,

the generating module is specifically configured to, if the memory state historical data set is generated according to a first DIMM set, generate a real fault tag corresponding to a first DIMM according to a condition that the first DIMM in the first DIMM set generates the UE within a preset time, where the real fault tag is a first tag or a second tag, the first tag indicates that the first DIMM has the UE within the preset time, and the second tag indicates that the first DIMM has no UE within the preset time;

if the memory state historical data set is generated according to a second DIMM set, generating a real fault tag corresponding to the second DIMM according to the condition that a memory fault list is established in a preset time by a second DIMM in the second DIMM set, wherein the real fault tag is a first tag or a second tag, the first tag represents that the memory fault list is established in the preset time by the second DIMM, and the second tag represents that the memory fault list is not established in the preset time by the second DIMM;

if the memory state historical data set is generated according to a third DIMM set, generating a real fault tag corresponding to the third DIMM according to the condition that a memory fault list is established in a preset time or the condition that UE (user equipment) appears in the preset time of the third DIMM in the third DIMM set, wherein the real fault tag is a first tag or a second tag, and the first tag represents that the memory fault list is established in the preset time or the UE appears in the preset time of the third DIMM.

In one possible design, in a fifth implementation form of the fourth aspect of the embodiments of the present application,

the generating module is specifically configured to obtain parameters corresponding to Q feature indicators according to the memory state historical data in the memory state historical data set, where the Q feature indicators include at least one of a correctable error CE number, a cell number, a row number, a column number, and a hard error number, and Q is an integer greater than or equal to 1;

and generating the characteristics to be trained corresponding to the DIMM according to the parameters corresponding to the Q characteristic indexes.

In one possible design, in a sixth implementation form of the fourth aspect of the embodiment of the present application,

the generating module is specifically configured to generate a plurality of feature subsets to be trained according to the memory state historical data set, where the feature subsets to be trained belong to the feature sets to be trained;

the obtaining module is specifically configured to obtain, through the to-be-trained memory detection submodel, a predicted fault label subset corresponding to a to-be-trained feature subset, where the predicted fault label subset belongs to the predicted fault label set, and the to-be-trained memory detection submodel belongs to one of the to-be-trained memory detection models;

the training module is specifically configured to generate a memory detection submodel according to the memory detection submodel to be trained if the predicted fault tag subset and the real fault tag subset satisfy the model verification condition, where the memory detection submodel belongs to one of the memory detection submodels.

In a possible design, in a seventh implementation manner of the fourth aspect of the embodiment of the present application, the memory detection model training apparatus further includes a determining module;

the determining module is used for training a to-be-trained memory detection model according to the to-be-trained feature set by the obtaining module to obtain a predicted fault label set, and then determining the label matching success rate according to the predicted fault label set and a real fault label set;

the determining module is further configured to determine that the predicted failure tag set and the real failure tag set meet the model verification condition if the tag matching success rate is greater than or equal to a preset matching threshold.

A fifth aspect of the present application provides a memory detection apparatus, including:

the acquisition module is used for acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to a dual in-line memory module (DIMM);

the generation module is used for generating a feature vector to be detected according to the DIMM if the UE with the uncorrectable error does not appear on the DIMM according to the log data to be detected acquired by the acquisition module;

the obtaining module is further configured to obtain, through a memory detection model, N failure probability scores corresponding to the to-be-detected feature vector generated by the generating module, where the memory detection model includes N memory detection submodels, each memory detection submodel corresponds to one failure probability score, and N is an integer greater than or equal to 1;

the generating module is further configured to generate a memory detection result corresponding to the DIMM if the N failure probability scores obtained by the obtaining module satisfy a failure detection condition.

In one possible design, in a first implementation manner of the fifth aspect of the embodiment of the present application, the memory detection apparatus further includes a determining module;

the obtaining module is further configured to determine a failure probability average value according to the N failure probability scores after obtaining the N failure probability scores corresponding to the feature vector to be detected through a memory detection model;

the determining module is configured to determine that the N failure probability scores satisfy the failure detection condition if the failure probability average value obtained by the obtaining module is greater than or equal to a failure probability threshold;

the determining module is further configured to determine that the N failure probability scores do not satisfy the failure detection condition if the failure probability average value acquired by the acquiring module is smaller than the failure probability threshold.

In a possible design, in a second implementation manner of the fifth aspect of the embodiment of the present application, the memory detection apparatus further includes a determining module and a processing module;

the determining module is further configured to determine a risk level according to the memory detection result after the generating module generates the memory detection result corresponding to the DIMM;

the processing module is configured to perform replacement processing on a memory if the risk level determined by the determining module is a high risk level, where the memory includes the DIMM;

the processing module is further configured to perform data migration on data in a memory if the risk level determined by the determining module is a medium risk level, where the memory includes the DIMM;

the processing module is further configured to perform data migration on core data in a memory if the risk level determined by the determining module is a low risk level, where the memory includes the DIMM.

In a possible design, in a third implementation manner of the fifth aspect of the embodiment of the present application, the memory detection apparatus further includes a processing module;

the processing module is configured to, after the generating module generates the memory detection result corresponding to the DIMM, perform data migration on data in a memory according to a first processing instruction if the first processing instruction is received, where the memory includes the DIMM;

the processing module is further configured to, if a second processing instruction is received, perform replacement processing on a memory according to the second processing instruction, where the memory includes the DIMM.

A sixth aspect of the present application provides a memory detection apparatus, including:

the log data to be detected comprises log data corresponding to K dual in-line memory modules (DIMMs), wherein K is an integer greater than or equal to 1;

a generating module, configured to generate a feature vector set to be detected according to remaining DIMMs if uncorrectable errors of P DIMMs are detected according to the log data to be detected acquired by the acquiring module, where the remaining DIMMs are the remaining DIMMs excluding the P DIMMs from the K DIMMs, the remaining DIMMs include (K-P) DIMMs, P is an integer greater than or equal to 0 and less than or equal to K, and the feature vector set to be detected includes (K-P) feature vectors to be detected;

the obtaining module is further configured to obtain (K-P) failure probability score sets corresponding to the feature vector set to be detected generated by the generating module through a memory detection model, where each failure probability score set includes N failure probability scores, the memory detection model includes N memory detection submodels, each memory detection submodel corresponds to one failure probability score, and N is an integer greater than or equal to 1;

a determining module, configured to determine T DIMMs from the (K-P) DIMMs according to the (K-P) sets of failure probability scores obtained by the obtaining module, where T is an integer greater than or equal to 0 and less than or equal to (K-P);

the generating module is further configured to generate a memory detection result according to the T DIMMs and the P DIMMs determined by the determining module.

A seventh aspect of the present application provides a node, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory and comprises the following steps:

if the predicted fault label set and the real fault label set meet the model verification condition, determining the memory detection model to be trained as a qualified memory detection model;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

An eighth aspect of the present application provides a node, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

if the N failure probability scores meet the failure detection condition, generating a memory detection result corresponding to the DIMM;

A ninth aspect of the present application provides a node, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

generating memory detection results according to the T DIMMs and the P DIMMs;

A tenth aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a method for training a memory detection model is provided, which includes the steps of firstly obtaining a memory state historical data set, then generating a real fault label set according to the memory state historical data set, and generating a feature set to be trained according to the memory state historical data set, then training the memory detection model to be trained according to the feature set to be trained to obtain a predicted fault label set, and if the predicted fault label set and the real fault label set meet a model verification condition, training the memory detection model to be trained according to the memory detection model to be trained to obtain the memory detection model. Through the mode, the memory detection model is trained based on the characteristic parameters of the DIMM, so that the memory detection model can predict the memory fault condition aiming at the granularity of the memory module level, the health condition and the risk level of the DIMM are fully considered, and the fault positioning accuracy of the memory detection is improved.

Drawings

FIG. 1A is a block diagram of an embodiment of a data sharing system;

FIG. 1B is a block chain structure diagram according to an embodiment of the present disclosure;

FIG. 1C is a diagram of an embodiment of a new block generation process in an embodiment of the present application;

FIG. 2 is a block diagram of an embodiment of a memory detection system;

FIG. 3 is a schematic diagram of an embodiment of a method for training a memory detection model in an embodiment of the present application;

fig. 4 is a schematic diagram of an embodiment of selecting memory state history data in the embodiment of the present application;

fig. 5 is a schematic diagram of another embodiment of selecting memory state history data in the embodiment of the present application;

fig. 6 is a schematic diagram of another embodiment of selecting memory state history data in the embodiment of the present application;

FIG. 7 is a schematic diagram of a training process of a memory detection submodel according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a training process of a memory detection model according to an embodiment of the present application;

fig. 9 is a schematic diagram of an embodiment of a method for memory detection in an embodiment of the present application;

fig. 10 is a schematic interface diagram of an alarm prompt based on a memory detection result in the embodiment of the present application;

fig. 11 is another interface diagram for performing alarm prompting based on the memory detection result in the embodiment of the present application;

fig. 12 is another interface diagram for performing alarm prompting based on the memory detection result in the embodiment of the present application;

fig. 13 is a schematic diagram of another embodiment of a memory detection method in the embodiment of the present application;

FIG. 14 is a schematic diagram of a prediction process of the memory detection model in the embodiment of the present application;

fig. 15 is a schematic diagram illustrating a comparison of ROC curve effects of UE or memory failure billing as a tag in the embodiment of the present application;

fig. 16 is a schematic diagram illustrating a comparison of Precision-reduce curve effects of a UE or memory failure list as a tag in the embodiment of the present application;

FIG. 17 is a schematic diagram of an embodiment of a memory detection model training apparatus according to the present disclosure;

FIG. 18 is a schematic diagram of another embodiment of a memory detection model training apparatus in an embodiment of the present application;

fig. 19 is a schematic diagram of an embodiment of a memory detection apparatus according to the present application;

fig. 20 is a schematic diagram of another embodiment of a memory detection apparatus according to an embodiment of the present application;

fig. 21 is a schematic diagram of another embodiment of a memory detection apparatus according to an embodiment of the present application;

fig. 22 is a schematic diagram of another embodiment of a memory detection apparatus according to an embodiment of the present application;

fig. 23 is a schematic diagram of an embodiment of a memory detection apparatus according to the present application;

FIG. 24 is a schematic structural diagram of a node in the embodiment of the present application;

fig. 25 is another schematic structural diagram of a node in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that the Memory detection method provided by the present application can perform fault diagnosis for Dual-Inline Memory Modules (DIMMs) based on Artificial Intelligence (AI) technology. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It should be understood that the memory detection model training method provided by the application can be used for Learning based on a Machine Learning (ML) algorithm, wherein the Machine Learning is a multi-field cross subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

It should be understood that the memory detection method provided by the present application may be applied to a block chain scenario, that is, each node in the block chain stores a memory detection model, and each node performs fault detection on a memory (in an embodiment, a DIMM) by using the memory detection model to obtain a memory detection result, and then records the memory detection result into a block, where the memory detection result includes, but is not limited to, fault information of the DIMM, a fault occurrence timestamp, and an identifier of the DIMM. It should be understood by those skilled in the art that although the present description is illustrated with a memory in the form of a DIMM, other types of memory may also use the solution proposed by the present invention.

Referring to fig. 1A, fig. 1A is a schematic structural diagram of a data sharing system in an embodiment of the present application, and as shown in the drawing, the data sharing system 100 refers to a system for performing data sharing between nodes, the data sharing system may include a plurality of nodes K1, and a plurality of nodes K1 may refer to respective clients in the data sharing system. Each node K1 may receive input information during normal operation and maintain shared data within the data sharing system based on the received input information. In order to ensure information intercommunication in the data sharing system, information connection can exist between each node in the data sharing system, and information transmission can be carried out between the nodes through the information connection. For example, when an arbitrary node in the data sharing system receives input information, other nodes in the data sharing system acquire the input information according to a consensus algorithm, and store the input information as data in shared data, so that the data stored on all the nodes in the data sharing system are consistent.

Each node in the data sharing system has a node identifier corresponding thereto, and each node in the data sharing system may store a node identifier of another node in the data sharing system, so that the generated block is broadcast to the other node in the data sharing system according to the node identifier of the other node in the following. Each node may maintain a node identifier list as shown in the following table, and store the node name and the node identifier in the node identifier list correspondingly. The node identifier may be an Internet Protocol (IP) address and any other information that can be used to identify the node, and only the IP address is used as an example in table 1.

TABLE 1

Node name	Node identification
		Node
1	117.114.151.174
		Node 2	117.116.189.145
…	…
		Node N	119.123.789.258

Each node in the data sharing system stores one identical blockchain. Referring to fig. 1B, fig. 1B is a schematic diagram of a block chain in the embodiment of the present application, where the block chain is composed of a plurality of blocks, as shown in the figure, a starting block includes a block header and a block main body, the block header stores an input information characteristic value, a version number, a timestamp, and a difficulty value, and the block main body stores input information; the next block of the starting block takes the starting block as a parent block, the next block also comprises a block head and a block main body, the block head stores the input information characteristic value of the current block, the block head characteristic value of the parent block, the version number, the timestamp and the difficulty value, and the like, so that the block data stored in each block in the block chain is associated with the block data stored in the parent block, and the safety of the input information in the block is ensured.

Referring to fig. 1C when generating each block in a block chain, fig. 1C is a schematic diagram of an embodiment of a new block generation process in the embodiment of the present application, as shown in the figure, when a node where the block chain is located receives input information, the input information is checked, after the check is completed, the input information is stored in a memory pool, and a hash tree used for recording the input information is updated; and then, updating the updating time stamp to the time when the input information is received, trying different random numbers, and calculating the characteristic value for multiple times, so that the calculated characteristic value can meet the following formula:

SHA256(SHA256(version+prev_hash+merkle_root+ntime+nbits+x))＜TARGET；

wherein, SHA256 is a characteristic value algorithm used for calculating a characteristic value; version number (version) is version information of related block protocols in the block chain; prev _ hash is a block head characteristic value of a parent block of the current block; merkle _ root is a characteristic value of the input information; ntime is the update time of the update timestamp; nbits is the current difficulty, is a fixed value within a period of time, and is determined again after exceeding a fixed time period; x is a random number; TARGET is a feature threshold, which can be determined from nbits.

Therefore, when the random number meeting the formula is obtained through calculation, the information can be correspondingly stored, and the block head and the block main body are generated to obtain the current block. And then, the node where the block chain is located respectively sends the newly generated blocks to other nodes in the data sharing system where the newly generated blocks are located according to the node identifications of the other nodes in the data sharing system, the newly generated blocks are verified by the other nodes, and the newly generated blocks are added to the block chain stored in the newly generated blocks after the verification is completed.

It should be understood that the memory Detection method provided by the present application may also be applied to a centralized service scenario, for convenience of understanding, please refer to fig. 2, where fig. 2 is a schematic diagram of an architecture of a memory Detection system in an embodiment of the present application, as shown in the drawing, when performing online memory fault prediction, log data is extracted from a server, And then a memory-related log of an Error Detection And Correction (EDAC) module is analyzed from the log data, where the memory-related log includes, but is not limited to, an Error type, a system timestamp, And detailed location information at a cell (cell) level. Firstly, DIMMs with Uncorrectable Errors (UE) in a memory related log are classified into a pre-fault pool, then, a memory detection model is utilized to score the rest DIMMs, the DIMMs reaching an operation set threshold are classified into the pre-fault pool, then, according to different risk levels of each DIMM in the pre-fault pool and in combination with operation strategies set by operators, corresponding measures are taken, data and service migration is carried out on the DIMMs with lower risk, data and service migration is carried out on the DIMMs with higher risk, and memory replacement is carried out on the DIMMs with high risk.

The server can feed back the risk level and the operation suggestion corresponding to the risk coefficient to the client, and the client displays the risk coefficient and the operation suggestion.

It should be noted that the client is disposed on a node, where the node includes but is not limited to a tablet computer, a notebook computer, a palm computer, a mobile phone, a voice interaction device, and a Personal Computer (PC), and is not limited herein. The voice interaction device includes, but is not limited to, an intelligent sound and an intelligent household appliance. The servers may also be nodes, and fig. 2 shows one server, and in practical applications, the number of servers is not limited.

With reference to the above description, a method for training a memory detection model in the present application will be described below, and referring to fig. 3, an embodiment of the method for training a memory detection model in the embodiment of the present application includes:

101. acquiring a memory state historical data set, wherein the memory state historical data set comprises M memory state historical data, each memory state historical data corresponds to one memory, and M is an integer greater than or equal to 1;

in this embodiment, the memory detection model training device obtains the memory state historical data set, and it is understood that the memory detection model training device is deployed on a node, where the node may be a terminal device or a server, and this is not limited here. The memory state historical data set can be selected from log data, a memory fault list, or both.

The memory status history data set comprises M memory status history data, each memory status history data corresponds to one DIMM, and the DIMM is a failed DIMM or a DIMM capable of working normally.

DIMMs in the present application include, but are not limited to, Registered DIMMs (RDIMMs), Unbuffered DIMMs (UDIMMs), and low Load DIMMs (LRDIMMs). The RDIMM adds a register to the memory bank for transmission, and the RDIMM is located between a Central Processing Unit (CPU) and a memory granule, so that the distance of parallel transmission is reduced, and the effectiveness of parallel transmission is guaranteed. LRDIMMs do not use complex registers, but simply buffer, which reduces the electrical load on the underlying motherboard, but have little impact on memory performance.

102. Generating a real fault label set according to the memory state historical data set, wherein the real fault label set comprises M real fault labels, and each real fault label corresponds to one memory;

in this embodiment, the node labels each memory state history data in the memory state history data set to obtain M real failure tags, that is, each DIMM corresponds to one real failure tag. In general, a real fault label is "1" indicating that there is a fault, and a real fault label is "0" indicating that there is no fault.

In practical application, a real fault label can be generated in a manual labeling mode, a label labeling rule can be set in a node, and the node automatically completes labeling according to the label labeling rule.

103. Generating a feature set to be trained according to the memory state historical data set, wherein the feature set to be trained comprises M features to be trained, each feature to be trained corresponds to one memory, and each feature to be trained comprises at least one parameter corresponding to a feature index;

in this embodiment, a feature set to be trained is generated based on the memory state historical data set, that is, a corresponding feature to be trained is generated for each memory state historical data, that is, one feature to be trained corresponds to one DIMM. The feature to be trained may include parameters under at least one feature index, each feature index corresponds to one dimension, for example, the feature to be trained may be represented as a 5-dimensional vector (1,1,0,0,1), and each element corresponds to one feature index.

The characteristic indexes include historical accumulated indexes, recent accumulated indexes, and Correctable Errors (CE) outbreak indexes. The historical accumulated indexes refer to corresponding indexes in a historical and present time range, the recent accumulated indexes refer to corresponding indexes which are pushed forward in a time window, the CE outbreak index refers to Cells (Cells) which occur more than W times per minute, and W is a preset parameter.

The historical accumulated indexes and the recent accumulated indexes comprise Hard Error (Hard Error) indexes and soft Error indexes, the Hard Error indexes comprise Correctable Errors (CE), UE (user equipment) and CE and UE, and the CE can comprise Errors (Errors) generated by memory inspection, Cells (Cells), row numbers (Rows), column numbers (Columns) and data generated by non-memory inspection.

104. Training a memory detection model to be trained according to a feature set to be trained to obtain a predicted fault label set, wherein the predicted fault label set comprises M predicted fault labels, and each predicted fault label corresponds to one memory;

in this embodiment, the node inputs each feature to be trained in the feature set to be trained to the memory detection model to be trained, the memory detection model to be trained outputs a corresponding predicted fault tag, each DIMM corresponds to one output predicted fault tag, and the predicted fault tag may be a probability value between 0 and 1.

The memory detection model to be trained may be a random forest regression model, and specifically, the random forest regression model to be trained is described below by using an example, however, in actual training, parameters of the random forest regression model may be adjusted, which is only an illustration here and should not be construed as a limitation to the present application. For example, 10 decision trees can be selected for the random forest regression model, 10 is selected for the maximum depth (max _ depth), 32 is selected for the maximum branch number, variance (variance) is selected for the impurity index, the minimum sample number of the leaf node is 1, and the sample number collection ratio is 1.0. If the maximum depth of the decision tree is equal to 0, the decision tree does not limit the depth of the subtree when constructing the optimal model. If the model sample size is large and the feature is also large, it is recommended to limit the maximum depth, and if the sample size is small or the feature is small, the maximum depth is not limited. The maximum number of branches refers to the maximum number of branches allowed for training the decision tree. The non-purity index refers to a decision tree feature selection algorithm. The leaf node minimum sample number (min _ samples _ leaf) represents the minimum sample contained by the leaf node, and if the leaf node sample number is less than min _ samples _ leaf, the leaf node and sibling leaf nodes are pruned, leaving only the parent node of the leaf node. The integer type represents a number, and the floating point type represents taking the smallest integer greater than or equal to (number of samples min samples leaf), where min samples leaf default value is 1. The sample number acquisition ratio, namely the data percentage used by training for each class, is defaulted to 1.0, and the training process can be accelerated by reducing the sample number acquisition ratio.

105. And if the predicted fault label set and the real fault label set meet the model verification condition, determining the memory detection model to be trained as a qualified memory detection model.

In this embodiment, the node compares the M predicted fault tags in the predicted fault tag set and the M real fault tags in the real fault tag set, determines whether a model verification condition is satisfied according to a comparison result obtained finally, and if the model verification condition is satisfied, may use a model parameter corresponding to the memory detection model to be trained as a model parameter of the memory detection model, thereby obtaining the memory detection model.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a first optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, acquiring a memory state historical data set may include:

acquiring log data and a memory fault list, wherein the log data comprises information related to a memory, and the memory fault list comprises information related to a fault;

a memory state history data set is generated from the first set of DIMMs.

In this embodiment, a memory detection model designed based on such a failure mode of the UE is introduced, and a memory state history data set needs to be extracted from log data and a memory failure list for the memory detection model designed based on such a failure mode. For convenience of understanding, referring to fig. 4, fig. 4 is a schematic diagram of an embodiment of selecting memory state history data in the embodiment of the present application, as shown in the figure, EDAC-related log data is selected from a kernel log Dmesg of a server system, where Dmesg is a program for detecting and controlling kernel ring buffering, and is used to help a user know startup information of the system. And extracting a memory fault list from the current network, wherein the memory fault list includes, but is not limited to, machine configuration information, component configuration information, prediction information and fault information, the machine configuration information includes, but is not limited to, a machine Serial Number (SN), an Internet Protocol (IP), a machine type and a service, the component configuration information includes, but is not limited to, an EDAC slot, a physical screen printing slot, a component SN and a component model, the prediction information includes, but is not limited to, prediction time, a risk coefficient and a risk level (operation setting), and the fault information includes, but is not limited to, a fault list Number, a fault type, list establishing time, list ending time and fault description.

Specifically, suppose that 6810 DIMMs are involved in the log data and the memory failure list, wherein 507 DIMMs with occurred UEs are present, and the remaining 6303 DIMMs with no occurred UEs are present, then a first DIMM set is generated based on the 507 DIMMs, i.e., each DIMM is a first DIMM representing the DIMM with the present UE, and a memory state history data set is generated according to the first DIMM set, i.e., the memory state history data set includes at least 507 memory state history data, i.e., the memory state history data set may include at least 1 memory state history data of the absent UE in addition to the memory state history data of the 507 occurred UEs.

Secondly, in the embodiment of the present application, a method for generating a memory status history data set is provided, and in the above manner, a modeling can be performed on a DIMM where a UE has occurred, so as to obtain a memory detection model for detecting the UE, thereby improving reliability of UE detection.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a second optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, acquiring a memory state historical data set may include:

acquiring a memory fault list, wherein the memory fault list comprises information related to faults;

acquiring a second DIMM set according to the memory fault list, wherein the second DIMM set comprises at least one second DIMM, and the second DIMM is a DIMM with the established memory fault list;

a memory state history data set is generated from the second set of DIMMs.

In this embodiment, another memory detection model designed based on such a failure mode that a memory failure list is created is introduced, and a memory state historical data set needs to be extracted from the memory failure list for the memory detection model designed based on such a failure mode. For convenience of understanding, please refer to fig. 5, where fig. 5 is a schematic view of another embodiment of selecting memory state history data in the embodiment of the present application, and as shown in the figure, a memory fault list is extracted from a current network, and details of the memory fault list will not be described in detail in this embodiment.

Specifically, it is assumed that 5000 DIMMs are involved in the memory failure list, wherein 1200 DIMMs in which the memory failure list is created and the remaining 3800 DIMMs in which the memory failure list is not created, and then a second DIMM set is generated based on the 1200 DIMMs, that is, each DIMM is a second DIMM, the second DIMM represents the DIMM with the established memory failure list, and a memory state history data set is generated according to the second DIMM set, that is, the memory state history data set includes at least 1200 memory state history data, that is, the memory state history data set may include at least 1 memory state history data of the non-established memory failure list in addition to the memory state history data of the 1200 established memory failure lists.

Secondly, in the embodiment of the application, another method for generating a memory state historical data set is provided, and through the method, modeling can be performed on the DIMM with the built memory fault list, so that a memory detection model for detecting fault information is obtained, and the reliability of fault information detection is improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a third optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, acquiring a memory state historical data set may include:

a memory state history data set is generated from the third set of DIMMs.

In this embodiment, another memory detection model designed based on the failure mode in which the UE occurs and the failure mode in which the memory failure list is created is introduced, and a memory state history data set needs to be extracted from log data and the memory failure list for the memory detection model designed based on the failure mode. For convenience of understanding, please refer to fig. 6, where fig. 6 is a schematic view of another embodiment of selecting memory state history data in the embodiment of the present application, and as shown in the figure, log data related to EDAC is analyzed within a period of time and a memory fault list is extracted from an existing network, which is not described in detail in this embodiment.

Specifically, the node selects Dmesg logs of a mass production environment, extracts EDAC memory fault related log data from the Dmesg logs, analyzes and distinguishes three different types of CEs and UEs (user equipments) with memory scrubbing errors (memory scrubbing errors), memory reading errors (memory read errors) and memory writing errors (memory write errors), and acquires position information and time stamps refined to a Cell level, so that data accuracy can be improved. Assuming that 7000 DIMMs are involved in the log data and the memory fault list, wherein 1000 DIMMs in which the UE has occurred, 1500 DIMMs in which the memory fault list has been created, and the remaining 4500 DIMMs in which the memory fault list has not been created and the UE has not occurred, then a third DIMM set is generated based on the 2500 DIMMs, that is, each DIMM is a third DIMM, the third DIMM represents a DIMM in which the memory fault list has been established or a DIMM in which the UE has occurred, and a memory state history data set is generated according to the third DIMM set, that is, the memory state history data set includes at least 2500 memory state history data, that is, the memory state history data set may include at least 1 memory state history data in which the memory fault list has not been established and the UE has not occurred, in addition to the 2500 memory state history data in which the memory fault list has been established or the UE has occurred.

Secondly, in the embodiment of the application, another method for generating a memory state historical data set is provided, and through the above manner, modeling can be performed on a DIMM with a UE or a DIMM with a built memory fault list, two different fault modes are synthesized for training, so that the method has a better effect than a model trained based on one fault mode alone, a predicted result can reflect a comprehensive risk, and meanwhile, the probability of faults covered by data for training is wider, so that the UE from EDAC is considered, the memory fault list built by the actual operation of the existing network is also considered, the memory fault list comprises the condition that some CE exceeds a threshold value, and the memory directly causes the scenes of multiple memory faults such as manual bill extraction after downtime. The conditions that the memory fault is missed due to the fact that the server event log is used as the source of the memory fault, the memory fault is missed, the disaster influence such as downtime is caused, and the log is not available are prevented, therefore, the comprehensiveness of memory fault definition is improved, and the accuracy and the reliability of prediction are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a fourth optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, generating a real fault tag set according to a memory state historical data set may include:

if the memory state historical data set is generated according to the first DIMM set, generating a real fault label corresponding to the first DIMM according to the condition that the UE appears in the first DIMM set within the preset time, wherein the real fault label is a first label or a second label, the first label represents that the UE appears in the first DIMM within the preset time, and the second label represents that the UE does not appear in the first DIMM within the preset time;

if the memory state historical data set is generated according to the second DIMM set, generating a real fault label corresponding to the second DIMM according to the condition that the second DIMM in the second DIMM set establishes the memory fault list within the preset time, wherein the real fault label is a first label or a second label, the first label represents that the second DIMM establishes the memory fault list within the preset time, and the second label represents that the second DIMM does not establish the memory fault list within the preset time;

if the memory state historical data set is generated according to the third DIMM set, generating a real fault label corresponding to the third DIMM according to the condition that the third DIMM in the third DIMM set establishes the memory fault list within the preset time or the condition that the UE appears within the preset time, wherein the real fault label is a first label or a second label, and the first label represents that the third DIMM establishes the memory fault list within the preset time or the UE appears within the preset time.

In this embodiment, a method for generating real failure labels for different failure modes is introduced, where feature parameters of memory state history data are assumed to be derived from statistics once per hour, that is, only a DIMM of a CE occurs in each hour window, there is a feature parameter, a real failure label of the feature parameter is 1 or 0, and the real failure labels are divided into three categories: specifically, the following three situations can be included:

in case one, the memory status history data set only includes data related to the DIMM where the UE has appeared;

the node judges whether the UE occurs in the first DIMM within a preset time, wherein the preset time can be 30 days, so that whether the UE error reported in the EDAC log data occurs within 30 days after a target hour (a certain statistical hour window) of the first DIMM is judged, if so, the real fault label is the first label (namely, the label is 1), and if not, the real fault label is the second label (namely, the label is 0).

In case two, the memory state history data set only includes data related to the DIMMs with the established memory fault list;

the node judges whether the situation of creating the memory fault list occurs within 30 days after a target hour (a certain statistical hour window) of the second DIMM within a preset time, and then judges whether the situation of creating the memory fault list occurs within 30 days after the target hour (a certain statistical hour window) of the second DIMM, if so, the real fault label is the first label (namely, the label is 1), and if not, the real fault label is the second label (namely, the label is 0).

In case three, the memory state history data set comprises data related to the DIMM with the UE and data related to the DIMM with the established memory fault list;

the node judges that the third DIMM has a UE condition within a preset time, or creates a memory fault list within the preset time, where the preset time may be 30 days, and then judges whether a UE error reported in EDAC log data occurs within 30 days after a target hour (a certain statistical hour window) of the SamDIMM, or whether a memory fault list is created, if one of the conditions is met, the true fault tag is a first tag (i.e., the tag is 1), and if neither of the conditions occurs, the true fault tag is a second tag (i.e., the tag is 0).

In the embodiment of the application, a mode for generating real fault labels aiming at different fault modes is provided, and through the mode, a label setting mechanism is established, so that the label printing operation is conveniently and automatically executed, and the method has better feasibility and operability for automatically generating the real fault labels.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a fifth optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, generating a feature set to be trained according to a memory state historical data set may include:

obtaining parameters corresponding to Q characteristic indexes according to memory state historical data in a memory state historical data set, wherein the Q characteristic indexes comprise at least one Q of the number of correctable errors CE, the number of cells, the number of rows, the number of columns and the number of hard errors, and the Q is an integer greater than or equal to 1;

In this embodiment, a method for generating a feature to be trained is introduced, and considering that a CE and a UE have a certain degree of correlation in time and space, multiple feature indexes are designed to reflect the health condition of a DIMM laterally, and the correlation between the CE and an actual memory fault list is compared, where the CE refers to a type of error that an error detection mechanism of a system can detect and automatically correct, and the UE refers to a type of error that the system cannot automatically correct because the number of errors exceeds the error correction limit of the error detection mechanism of the system.

The method comprises the steps of designing 422 statistical characteristics for the DIMM level according to distribution of different types of CEs in historical time and recent time, and selecting 12 characteristics with the highest fault relevance degree, namely Q can be specifically 12, wherein the 12 characteristics cover at least one of CE number, cell number, row number, column number and hard error number. Then, the nodes can train a memory detection model by using a random forest regression algorithm according to the change trend and the position distribution condition of the CE given by the characteristics, periodically predict the fault probability of each DIMM, and enable the prediction performance of the DIMM level to reach a higher level on a large number of data sets.

For ease of understanding, please refer to table 2, where table 2 is an illustration of 12 characteristic indices.

TABLE 2

As can be seen from table 2, each memory state history data can obtain corresponding parameters for the 12 kinds of characteristic indexes. And the 12 characteristic indexes are better in the distribution situations of Pearson correlation coefficients, information gains, information gain rates and Kernel coefficients.

Secondly, in the embodiment of the application, a method for generating a feature to be trained is provided, that is, one feature to be trained at least comprises parameters corresponding to one of the number of CEs, the number of cells, the number of rows, the number of columns and the number of hard errors, and in the manner, considering that the CEs and the UEs have certain degree of correlation in time and space, therefore, various feature indexes are designed to reflect the health condition of the DIMM laterally, and compared with the correlation between the CE and the actual memory fault list, and the distribution conditions of the pearson correlation coefficient, the information gain rate, the kini coefficient and the indexes, 12 indexes can be selected as the feature to be trained for modeling, thereby improving the training reliability.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a sixth optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, generating a feature set to be trained according to a memory state historical data set may include:

generating a plurality of feature subsets to be trained according to the memory state historical data set, wherein the plurality of feature subsets to be trained belong to the feature sets to be trained;

training a memory detection model to be trained according to a feature set to be trained to obtain a predicted fault label set, comprising:

acquiring a predicted fault label subset corresponding to the feature subset to be trained through a memory detection submodel to be trained, wherein the predicted fault label subset belongs to the predicted fault label set, and the memory detection submodel to be trained belongs to one submodel of the memory detection models to be trained;

if the predicted fault label set and the real fault label set meet the model verification condition, training according to the memory detection model to be trained to obtain a memory detection model, wherein the method comprises the following steps:

and if the predicted fault label subset and the real fault label subset meet the model verification condition, generating a memory detection submodel according to the memory detection submodel to be trained, wherein the memory detection submodel belongs to one of the memory detection models.

In this embodiment, a method for establishing multiple memory detection submodels is introduced, different task streams are constructed through visual dragging based on a one-stop task management and distributed scheduling platform and a one-stop machine learning platform, so as to train a memory detection submodel, it is understood that in practical application, other different platforms may be used for task learning, and this time is not limited. The memory detection model can comprise a plurality of memory detection submodels, and the training mode of each memory detection submodel is similar, namely, firstly, the memory state historical data set is split into a plurality of feature subsets to be trained, and then the corresponding memory detection submodel is trained based on each feature subset to be trained.

For easy understanding, please refer to fig. 7, and fig. 7 is a schematic diagram illustrating a training process of the memory detection submodel in the embodiment of the present application, where as shown, the model training stage is divided into three steps, namely, preparation of historical data, model training and tuning, and model verification and updating. In particular, the amount of the solvent to be used,

in the step A1, a node selects a large number of Dmesg logs, and extracts historical logs related to EDAC memory faults from the Dmesg logs;

in step a2, analyzing the node to obtain CE and UE distinguishing three different types of memory scrubbing error, memory read error and memory write error, and obtaining location information and a timestamp detailed to a Cell level;

step A3, preparing CE statistical index data, namely acquiring a memory state historical data set;

in step A4, acquiring a historical memory fault list from the current network;

in step A5, according to the three fault definitions, respectively marking real fault labels on historical data of the memory state;

in the step A6, after a real fault label is obtained, splitting the memory state historical data set into N memory state historical data subsets, repeating the splitting of each memory state historical data subset into a training set, a cross validation set and a test set at random according to the proportion of 3:1:1 for many times, and preparing data for a next training model;

in step a7, inputting each memory state historical data subset into a corresponding random forest regression model, wherein a random forest is an ensemble (ensemble) algorithm of a decision tree and can be used for classification and regression, the selected regression model is used for outputting a probability value between 0 and 1, the number of the decision tree is 10, the maximum depth is 10, the maximum branch number is 32, the impurity index is variable, the minimum sample number of leaf nodes is 1, the sample number acquisition ratio is 1.0, training random forest regression models on N training sets respectively, and optimizing on respective cross validation sets respectively to obtain N optimized memory detection submodels;

in step A8, the N optimized memory detection submodels obtained are applied to respective test sets to verify the effect of the model, if the performance meets the model verification condition, the memory detection submodel meeting the condition can be used to replace the model on the off-line, otherwise, the model is not updated. If modeling is carried out for the first time, repeated training is needed until the model effect reaches the standard.

For ease of understanding, please refer to fig. 8, and fig. 7 is a schematic diagram illustrating a training process of the memory detection submodel according to the embodiment of the present application, and as shown in the figure, specifically,

in step B1, the node selects a large number of Dmesg logs, and extracts the historical logs related to the EDAC memory failure from the Dmesg logs;

in step B2, the node analyzes the historical log to obtain CE and UE which distinguish three different types of memory scrubbing error, memory read error and memory write error, and obtains position information and a time stamp which are refined to the Cell level;

in step B3, CE statistical index data is prepared, that is, a memory state historical data set is obtained;

in step B4, acquiring a historical memory fault list from the current network;

in step B5, according to the three fault definitions, respectively marking real fault labels on the historical data of the memory state;

in step B6, after a real fault label is obtained, extracting a memory state historical data subset from a memory state historical data set, repeating the steps for many times according to the proportion of 3:1:1, and randomly splitting the memory state historical data subset into a training set, a cross validation set and a test set to prepare data for a next training model;

step B7, inputting the memory state historical data subset into a corresponding random forest regression model, predicting a fault label by the model, training the random forest regression model on a training set, and optimizing on a cross validation set to obtain an optimized memory detection sub-model;

in step B8, the optimized memory detection submodel is applied to the test set to verify the effect of the model, if the performance reaches the model verification condition, the memory detection submodel that reaches the condition can be used to replace the model on the dropped line, otherwise, the model is not updated. If modeling is carried out for the first time, repeated training is needed until the model effect reaches the standard.

It can be understood that, since the historical data will not change much in a short time, the update period of the model need not be too short, for example, the model can be processed off-line, and a model update is triggered manually after a certain period of time.

Secondly, in the embodiment of the application, a method for establishing a plurality of memory detection submodels is provided, and through the method, the plurality of memory detection submodels obtained through training can reduce the interference of on-line abnormal data on the models, so that the output of abnormal high or abnormal low risk levels is caused, and the accuracy and reliability of model prediction are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in a seventh optional embodiment of the method for training a memory detection model provided in the embodiment of the present application, after training the memory detection model to be trained according to the feature set to be trained to obtain the predicted failure tag set, the method may further include:

determining the success rate of label matching according to the predicted fault label set and the real fault label set;

and if the success rate of the label matching is greater than or equal to a preset matching threshold, determining that the predicted fault label set and the real fault label set meet the model verification condition.

In this embodiment, a specific situation that satisfies a model verification condition is introduced, specifically, after a node obtains a set of predicted fault tags, each of the predicted fault tags is matched with a real fault tag, taking 1000 predicted fault tags as an example, it is assumed that 900 predicted fault tags among 1000 predicted fault tags are successfully matched with corresponding real fault tags, that is, the tag matching success rate is 90%, the node determines whether the tag matching success rate is greater than or equal to a preset matching threshold, and if the preset matching threshold is 90%, the tag matching success rate is greater than the preset matching threshold, and then the node determines that the model verification condition is satisfied.

Secondly, in the embodiment of the application, a specific condition meeting the model verification condition is provided, and by the above mode, the tag matching success rate can be used as a reference basis for meeting the model verification condition, and the preset matching threshold can be flexibly adjusted according to the requirement, so that the flexibility of model training is improved.

With reference to the above description, the method for detecting a memory in the present application will be described below, and referring to fig. 9, an embodiment of the method for detecting a memory in the embodiment of the present application includes:

201. acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to a memory;

in this embodiment, the memory detection device extracts log data to be detected from the Dmesg log collected on-line, where the log data to be detected includes log data of the DIMM to be detected. It can be understood that the memory detection apparatus is disposed on a node, and the node may be a terminal device or a server, which is not limited herein.

202. If the UE with the uncorrectable error does not appear in the memory according to the log data to be detected, generating a feature vector to be detected according to the memory;

in this embodiment, the node may perform UE detection on the DIMM according to the log data to be detected, that is, the node determines that the DIMM has a UE according to DIMM position information, time information, error type information, and the like in the log data to be detected, and once the UE has occurred, the DIMM is added to a prediction pool, and the prediction pool is used for storing the risky DIMM. If the detected feature vector does not appear, the feature vector to be detected is generated according to the DIMM, the feature vector to be detected can be represented as a 12-dimensional vector (1,1,0,0,1,1,0,1,1,0,0,1), and each element corresponds to one feature index.

203. Obtaining N fault probability scores corresponding to the feature vectors to be detected through a memory detection model, wherein the memory detection model comprises N memory detection submodels, each memory detection submodel corresponds to one fault probability score, and N is an integer greater than or equal to 1;

in this embodiment, a node inputs a feature vector to be detected to a memory detection model, where one memory detection model may include N memory detection submodels, and each memory detection submodel outputs a corresponding failure probability score, so that the N failure probability scores may be obtained, and whether a failure detection condition is satisfied is determined based on the N failure probability scores, and if so, the DIMM is added to a prediction pool, and if not, it is determined that the DIMM is healthy.

204. And if the N fault probability scores meet the fault detection condition, generating a memory detection result corresponding to the memory.

In this embodiment, if it is determined at the node that the N failure probability scores satisfy the failure detection condition, a memory detection result corresponding to the DIMM may be generated, and the memory detection result may be a risk level and a suggested operation of the DIMM.

The embodiment of the application provides a memory detection method, namely, to-be-detected log data are obtained firstly, if uncorrectable errors of DIMMs are detected not to occur according to the to-be-detected log data, to-be-detected feature vectors are generated according to the DIMMs, N fault probability scores corresponding to the to-be-detected feature vectors are obtained through a memory detection model, and if the N fault probability scores meet a fault detection condition, a memory detection result corresponding to the DIMMs is generated. By the mode, the memory detection model is trained based on the characteristic parameters of the DIMM, so that the memory detection model can predict the memory fault condition aiming at the granularity of the memory module level, the health condition and the risk level of the DIMM are fully considered, the influence of interference data at the system level is reduced, and the fault positioning accuracy of the memory detection is improved.

Optionally, on the basis of each embodiment corresponding to fig. 9, in a first optional embodiment of the memory detection method provided in this embodiment of the present application, after obtaining N failure probability scores corresponding to the feature vectors to be detected through the memory detection model, the method may further include:

determining a fault probability average value according to the N fault probability scores;

if the mean value of the fault probabilities is larger than or equal to the threshold value of the fault probabilities, determining that the N fault probability scores meet the fault detection condition;

and if the fault probability average value is smaller than the fault probability threshold value, determining that the N fault probability scores do not meet the fault detection condition.

In this embodiment, a method for determining a fault detection condition based on N fault probability scores is introduced, so as to improve robustness of a model and avoid that abnormal data may be introduced when an online model is automatically updated, thereby causing interference and influence on the model, and thus a voting mechanism is added.

Specifically, it is assumed that N is 5, that is, one memory detection model includes 5 memory detection submodels, the 5 memory detection submodels respectively output corresponding failure probability scores, it is assumed that the memory detection submodel a outputs a failure probability score of 0.2, the memory detection submodel B outputs a failure probability score of 0.4, the memory detection submodel C outputs a failure probability score of 0.6, the memory detection submodel D outputs a failure probability score of 0.8, and the memory detection submodel E outputs a failure probability score of 1, so that the average failure probability value obtained is (0.2+0.4+0.6+0.8+1)/5 ═ 0.6. Assuming that the failure probability threshold is 0.5, it is determined that the failure detection condition is satisfied.

Optionally, if N is 5, that is, one memory detection model includes 5 memory detection submodels, the 5 memory detection submodels respectively output corresponding failure probability scores, if the memory detection submodel a outputs a failure probability score of 0.9, the memory detection submodel B outputs a failure probability score of 0.4, the memory detection submodel C outputs a failure probability score of 0.6, the memory detection submodel D outputs a failure probability score of 0.8, and the memory detection submodel E outputs a failure probability score of 0.1, then one abnormal high score of 0.9 and one abnormal low score of 0.1 may be removed first, and then the remaining 0.4, 0.6, and 0.8 are averaged to obtain a failure probability average value of (0.4+0.6+0.8)/3 ═ 0.6. Assuming that the failure probability threshold is 0.5, it is determined that the failure detection condition is satisfied.

It should be understood that, in practical applications, the method is not limited to the case of removing only one highest score and one lowest score, and different mechanisms may be further provided to select or process each failure probability score, which is only an illustration here and should not be construed as a limitation to the present application.

If one memory detection model comprises 3 memory detection submodels, removing a highest score and a lowest score, wherein the failure probability average value is the median of 3 failure probability scores.

Secondly, in the embodiment of the application, a mode for determining fault detection conditions based on N fault probability scores is provided, and final judgment inaccuracy caused by abnormally high or abnormally low risk levels can be effectively prevented through the mode, so that the accuracy and reliability of prediction can be improved by adopting a few majority-obeying judgment mechanisms.

Optionally, on the basis of the various embodiments corresponding to fig. 9, in a second optional embodiment of the method for memory detection provided in the embodiment of the present application, after the generating the memory detection result corresponding to the DIMM, the method may further include:

determining a risk level according to a memory detection result;

if the risk level is a high risk level, performing replacement processing on the memory, wherein the memory comprises DIMMs;

if the risk level is a medium risk level, data migration is carried out on data in a memory, wherein the memory comprises DIMMs;

and if the risk level is a low risk level, performing data migration on core data in a memory, wherein the memory comprises DIMMs.

In this embodiment, a method for performing automatic operation based on a memory detection result is introduced, and after a memory detection result corresponding to a DIMM is generated, a node may automatically process the DIMM in a prediction pool.

Specifically, the node may determine the risk level according to the memory detection result, and assume that the currently set risk level includes a high risk level, a medium risk level, and a low risk level, so the node may automatically initiate processing based on different risk levels according to an operation policy preset by an operation expert.

Referring to fig. 10, fig. 10 is a schematic interface diagram of performing alarm prompting based on the memory detection result in the embodiment of the present application, as shown in the figure, if the risk level is a high risk level, the node performs replacement processing on the memory, that is, the alarm prompting on the interface is: the risk level is high, the operation is recommended to be memory replacement, and for the condition of high risk level, the nodes do data backup and replace the DIMM with failure after service migration. Referring to fig. 11, fig. 11 is another schematic diagram of an interface for performing alarm prompting based on a memory detection result in the embodiment of the present application, as shown in the figure, if the risk level is an intermediate risk level, the node performs data migration processing, that is, the alarm prompting on the interface is: and in the risk level 'middle', the operation is recommended to be 'data migration', and data backup and service migration can be well done under the condition that the node is in the middle risk level. Referring to fig. 12, fig. 12 is another schematic diagram of an interface for performing alarm prompting based on a memory detection result in the embodiment of the present application, as shown in the figure, if the risk level is a low risk level, the node performs data migration processing, that is, the alarm prompting on the interface is: the risk level is 'low', the operation is recommended to be 'core data migration', and the node performs data backup and service migration on core and sensitive services under the condition of low risk level.

It is understood that, in practical applications, the memory may also be processed in other manners, such as only triggering an alarm prompt, which is not limited herein.

In the embodiment of the application, a mode of automatic operation based on the memory detection result obtained by prediction is provided, and through the mode, whether to initiate the pre-fault in the prediction pool or not is determined according to the operation strategy preset by the operation expert, and a corresponding fault processing flow is automatically initiated for the pre-fault DIMM meeting different operation strategies, so that manual triggering of alarm and execution of related processing are not needed, and the convenience of the scheme is improved.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 9, in a third optional embodiment of the method for memory detection provided in the embodiment of the present application, after generating the memory detection result corresponding to the DIMM, the method may further include:

if receiving a first processing instruction, performing data migration on data in a memory according to the first processing instruction, wherein the memory comprises a DIMM;

and if the second processing instruction is received, performing replacement processing on the memory according to the second processing instruction, wherein the memory comprises the DIMM.

In this embodiment, a method for performing manual operation based on a memory detection result is introduced. After the memory detection result corresponding to the DIMM is generated, the DIMM may be placed in a prediction pool for non-processing, and at the same time, a fault list corresponding to the DIMM may be activated, so that a user may select whether to process data in the memory or not or whether to replace the memory according to the requirement.

Specifically, if the user considers that the current risk level of the DIMM failure needs to perform data migration on data in the memory, a first processing instruction may be triggered to the node, and if the user considers that the current risk level of the DIMM failure needs to perform replacement processing on the memory, that is, replace the entire memory, a second processing instruction may be triggered to the node. Alternatively, if the user considers that the current risk level of DIMM failure requires data migration of the core data in the memory, a third processing instruction may be triggered to the node. In practical application, the memory may also be processed in other manners, such as only triggering an alarm prompt, which is not limited herein.

In the embodiment of the application, a manual operation mode based on the memory detection result obtained through prediction is provided, and through the mode, as the operation strategy is artificially defined and the importance degree of the service is relatively subjective, a channel for manually triggering an alarm in operation is reserved, so that the feasibility and the flexibility of the scheme are improved.

With reference to the above description, the method for detecting a memory in the present application will be described below, and referring to fig. 13, an embodiment of the method for detecting a memory in the embodiment of the present application includes:

301. acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to K memories, and K is an integer greater than or equal to 1;

in this embodiment, the memory detection device extracts log data to be detected from the Dmesg log collected on line, where the log data to be detected includes log data of K DIMMs to be detected, and K is an integer greater than or equal to 1. It can be understood that the memory detection apparatus is disposed on a node, and the node may be a terminal device or a server, which is not limited herein.

302. If the uncorrectable error UE of P memories is detected according to the log data to be detected, generating a feature vector set to be detected according to the residual memories, wherein the residual memories are the memories left after P memories are removed from K memories, the residual memories comprise (K-P) memories, P is an integer which is greater than or equal to 0 and less than or equal to K, and the feature vector set to be detected comprises (K-P) feature vectors to be detected;

in this embodiment, the node first performs UE detection on K DIMMs according to log data to be detected, that is, the node determines, according to position information, time information, error type information, and the like of each DIMM in the log data to be detected, that each DIMM has a UE within a preset time, where the preset time may be 1 day, and then determines whether a UE error is reported in the log data within 1 day after a target hour (a certain statistical hour window) of the DIMM, and if so, adds the DIMM to a prediction pool, where the prediction pool is used to store dangerous DIMMs. If the detected feature vector does not appear, the feature vector to be detected is generated according to the DIMM, the feature vector to be detected can be represented as a 12-dimensional vector (1,1,0,0,1,1,0,1,1,0,0,1), and each element corresponds to one feature index.

Assuming that P DIMMs out of the K DIMMs have UE present, the P DIMMs are added to the prediction pool, and the remaining (K-P) DIMMs generate corresponding feature vectors to be detected respectively.

303. Acquiring (K-P) failure probability score sets corresponding to the feature vector sets to be detected through a memory detection model, wherein each failure probability score set comprises N failure probability scores, the memory detection model comprises N memory detection submodels, each memory detection submodel corresponds to one failure probability score, and N is an integer greater than or equal to 1;

in this embodiment, a node inputs (K-P) feature vectors to be detected to corresponding memory detection models, each memory detection model includes N memory detection submodels, and each memory detection submodel outputs a corresponding failure probability score, so that each DIMM to be detected can correspond to N failure probability scores, each DIMM to be detected judges whether a failure detection condition is satisfied based on the corresponding N failure probability scores, if so, the DIMM is added to a prediction pool, and if not, the DIMM is indicated to be in a higher health level.

304. Determining T memories from the (K-P) memories according to the (K-P) failure probability score sets, wherein T is an integer which is greater than or equal to 0 and less than or equal to (K-P);

in this embodiment, the node determines T DIMMs from the (K-P) DIMMs according to the determination in step 303, and adds the T DIMMs to the prediction pool if the failure detection condition is satisfied. Wherein T is an integer greater than or equal to 0 and less than or equal to (K-P).

305. And generating a memory detection result according to the T memories and the P memories.

In this embodiment, the node generates corresponding memory detection results according to the T DIMMs and the P DIMMs, where the memory detection results may be risk levels and suggested operations of the DIMMs.

For convenience of understanding, please refer to fig. 14, where fig. 14 is a schematic diagram of a prediction process of the memory detection model in the embodiment of the present application, and as shown in the figure, the model prediction stage is divided into three steps, namely, the online data preparation, the model prediction scoring, and the fault initiation process. In particular, the amount of the solvent to be used,

in step C1, the node selects Dmesg logs from the line, extracts historical logs related to EDAC memory faults from the Dmesg logs, and acquires corresponding DIMM position, time and error type information;

in step C2, the node analyzes the history log to obtain CE and UE which distinguish three different types of memory scrubbing error, memory read error and memory write error, and obtains location information and a timestamp detailed to the Cell level;

in step C3, the node firstly checks the DIMM for UE based on the history log, and if the EDAC scans that a certain DIMM has UE, the node is directly thrown to the prediction pool;

in step C4, the node then cleans and preprocesses the CE statistical index data for the remaining DIMMs, prepares the CE statistical index data, i.e., obtains the data to be detected as the input of the random forest regression model;

in the step C5, respectively using N random forest regression models to calculate the fault probability corresponding to the CE statistical index data;

in step C6, outputting the failure probability scores of N DIMMs by the N random forest regression models;

in step C7, counting the number larger than the threshold value and the number smaller than the threshold value according to the preset threshold value, adopting a minority obeying majority mechanism, and judging whether the median of the probability scores of the faults of the N DIMM levels exceeds the threshold value to be used as a pre-fault to be thrown into a prediction pool;

in step C8, cleaning the detailed configuration information of the real-time memory, and providing data support for creating a fault list for the subsequent pre-fault memory;

in step C9, the node decides whether to initiate a pre-fault in the prediction pool according to an operation strategy preset by an operation expert, automatically initiates a corresponding fault processing flow for a faulty DIMM meeting different operation strategies, performs data backup and service migration for core and sensitive services with lower risk, performs data backup and service migration for services with higher risk, and performs data backup and service migration for the DIMM after the data backup and the service migration for the services with high risk;

in step C10, since the operation policy is artificially defined and the importance of the service is relatively subjective, a channel for manually triggering an alarm is reserved for operation.

The embodiment of the application provides a memory detection method, namely, log data to be detected are obtained firstly, if uncorrectable errors of P DIMMs are detected according to the log data to be detected, a characteristic vector set to be detected is generated according to the rest DIMMs, then (K-P) failure probability score sets corresponding to the characteristic vector set to be detected are obtained through a memory detection model, T DIMMs are determined from the (K-P) DIMMs according to the (K-P) failure probability score sets, and finally, a memory detection result is generated according to the T DIMMs and the P DIMMs. By the mode, the memory detection model is trained based on the characteristic parameters of the DIMM, so that the memory detection model can predict the memory fault condition aiming at the granularity of the memory module level, the health condition and the risk level of the DIMM are fully considered, the influence of interference data at the system level is reduced, and the fault positioning accuracy of the memory detection is improved.

Based on the scheme provided by the application, the method realizes the diagnosis and repair of the memory fault in the industry to the early warning crossing of the fault, and can ensure relatively high accuracy. Referring to fig. 15, fig. 15 is a schematic diagram comparing the ROC Curve effect of the UE or memory failure list as the label in the embodiment of the present application, and as shown in the figure, the Area (Area Under Curve, AUC) enclosed by the coordinate axis Under the observer Operating Characteristic (ROC) Curve of the Random Forest (Random Forest) is larger. Referring to fig. 16, fig. 16 is a schematic diagram comparing effects of Precision-Recall curves of a tag created by a UE or a memory failure in the embodiment of the present application, where random forest also has higher accuracy (Precision) and Recall (Recall). Referring to table 3, table 3 shows an index comparison between different types of model algorithms.

TABLE 3

Model (model)	LR	SVM	Decision Tree	Random Forest
					AUC	0.742	0.676	0.910	0.932
Precision	0.381	0.462	0.738	0.758
					Recall	0.505	0.388	0.589	0.606
F-score	0.434	0.422	0.655	0.674

Therefore, the Random Forest Regression effect is in ROC, AUC, Positive Rate (PR) or F-score (F-score), and can achieve better performance effect compared with a Support Vector Machine (SVM) algorithm, a Logistic Regression (LR) algorithm and a Decision Tree (Decision Tree) algorithm, the AUC can reach 0.932, and the F-score has 0.674.

Since the prediction objects of the current scheme are different, International Business Machines (IBM) is a system level prediction for UE, and Intel Corporation is a Cell level prediction for CE, so that direct comparison is not possible. Experimental data for the prior art protocol are listed here for reference. Referring to table 4, table 4 shows the predicted effect of IBM on system UE.

TABLE 4

Referring to table 5, table 5 shows the predicted effect of Intel on CE at Cell level.

TABLE 5

Method	Precision	Recall	F-score
				Historical data (1,1)	0.359	0.351	0.349
Historical data (1,1)	0.678	0.252	0.362
				Historical data (1,1)	0.109	0.522	0.174
Historical data (1,1)	0.329	0.390	0.332
				Historical data (1,1)	0.261	0.415	0.316
Study only	0.592	0.320	0.409
				Learning + propagation	0.493	0.436	0.456

As can be seen from tables 4 and 5, the predicted AUC of IBM for system UE in 2 weeks can reach 0.83, and F-score of Intel based on Cell level prediction model can reach 0.456.

Referring to fig. 17, fig. 17 is a schematic diagram of an embodiment of a memory detection model training apparatus in an embodiment of the present application, and the memory detection model training apparatus 40 includes:

an obtaining module 401, configured to obtain a memory state historical data set, where the memory state historical data set includes M memory state historical data, each memory state historical data corresponds to one memory, and M is an integer greater than or equal to 1;

a generating module 402, configured to generate a real fault tag set according to the memory state historical data set acquired by the acquiring module 401, where the real fault tag set includes M real fault tags, and each real fault tag corresponds to one memory;

the generating module 402 is further configured to generate a feature set to be trained according to the memory state historical data set acquired by the acquiring module 401, where the feature set to be trained includes M features to be trained, each feature to be trained corresponds to one memory, and each feature to be trained includes at least one parameter corresponding to a feature index;

the obtaining module 401 is further configured to train a to-be-trained memory detection model according to the to-be-trained feature set to obtain a predicted fault tag set, where the predicted fault tag set includes M predicted fault tags, and each predicted fault tag corresponds to one memory;

a training module 403, configured to determine that the memory detection model to be trained is a qualified memory detection model if the predicted failure tag set and the real failure tag set obtained by the obtaining module 401 meet a model verification condition.

Optionally, on the basis of the embodiment corresponding to fig. 17, in another embodiment of the memory detection model training apparatus 40 provided in the embodiment of the present application,

the obtaining module 401 is specifically configured to obtain log data and a memory fault list, where the log data includes information related to a memory, and the memory fault list includes information related to a fault;

acquiring a first memory set according to the log data and the memory fault list, wherein the first memory set comprises at least one first memory, and the first memory is a memory of the UE with uncorrectable errors;

and generating the memory state historical data set according to the first memory set.

the obtaining module 401 is specifically configured to obtain a memory fault list, where the memory fault list includes information related to a fault;

acquiring a second memory set according to the memory fault list, wherein the second memory set comprises at least one second memory, and the second memory is a memory with a built memory fault list;

and generating the memory state historical data set according to the second memory set.

acquiring a third memory set according to the log data and the memory fault list, wherein the third memory set comprises at least one third memory, and the third memory is a memory of the UE with the uncorrectable error or a memory of the established memory fault list;

and generating the memory state historical data set according to the third memory set.

the generating module 402 is specifically configured to generate a real fault tag corresponding to a first memory according to a situation that a UE occurs in a first memory set within a preset time if the memory state historical data set is generated according to the first memory set, where the real fault tag is a first tag or a second tag, the first tag indicates that the UE occurs in the first memory within the preset time, and the second tag indicates that the UE does not occur in the first memory within the preset time;

if the memory state historical data set is generated according to a second memory set, generating a real fault tag corresponding to the second memory according to the condition that a memory fault list is established in the second memory set within a preset time, wherein the real fault tag is a first tag or a second tag, the first tag represents that the memory fault list is established in the second memory within the preset time, and the second tag represents that the memory fault list is not established in the second memory within the preset time;

if the memory state historical data set is generated according to a third memory set, generating a real fault tag corresponding to the third memory according to the condition that a memory fault list is established in the third memory set within a preset time or the condition that UE (user equipment) appears within the preset time, wherein the real fault tag is a first tag or a second tag, and the first tag represents that the memory fault list is established in the third memory within the preset time or the UE appears within the preset time.

the generating module 402 is specifically configured to obtain parameters corresponding to Q feature indicators according to the memory state historical data in the memory state historical data set, where the Q feature indicators include at least one of a correctable error CE number, a cell number, a row number, a column number, and a hard error number, and Q is an integer greater than or equal to 1;

and generating the characteristics to be trained corresponding to the memory according to the parameters corresponding to the Q characteristic indexes.

the generating module 402 is specifically configured to generate a plurality of feature subsets to be trained according to the memory state historical data set, where the feature subsets to be trained belong to the feature set to be trained;

the obtaining module 401 is specifically configured to obtain, through the to-be-trained memory detection submodel, a predicted fault label subset corresponding to a to-be-trained feature subset, where the predicted fault label subset belongs to the predicted fault label set, and the to-be-trained memory detection submodel belongs to one of the to-be-trained memory detection models;

the training module 403 is specifically configured to generate a memory detection submodel according to the memory detection submodel to be trained if the predicted fault tag subset and the real fault tag subset satisfy the model verification condition, where the memory detection submodel belongs to one of the memory detection submodels.

Optionally, on the basis of the embodiment corresponding to fig. 17, please refer to fig. 18, in another embodiment of the memory detection model training device 40 provided in the embodiment of the present application, the memory detection model training device 40 further includes a determining module 404;

the determining module 404 is configured to train, by the obtaining module 401, a to-be-trained memory detection model according to the to-be-trained feature set to obtain a predicted fault tag set, and then determine a tag matching success rate according to the predicted fault tag set and a real fault tag set;

the determining module 404 is further configured to determine that the predicted failure tag set and the real failure tag set meet the model verification condition if the tag matching success rate is greater than or equal to a preset matching threshold.

Referring to fig. 19, fig. 19 is a schematic diagram of an embodiment of a memory detection apparatus in an embodiment of the present application, in which the memory detection apparatus 50 includes:

an obtaining module 501, configured to obtain log data to be detected, where the log data to be detected includes log data corresponding to a memory;

a generating module 502, configured to generate a to-be-detected feature vector according to the memory if it is detected that the uncorrectable error UE does not occur in the memory according to the to-be-detected log data acquired by the acquiring module 501;

the obtaining module 501 is further configured to obtain, through a memory detection model, N failure probability scores corresponding to the to-be-detected feature vector generated by the generating module 502, where the memory detection model includes N memory detection submodels, each memory detection submodel corresponds to one failure probability score, and N is an integer greater than or equal to 1;

the generating module 502 is further configured to generate a memory detection result corresponding to the memory if the N failure probability scores obtained by the obtaining module 501 satisfy a failure detection condition.

Optionally, on the basis of the embodiment corresponding to fig. 19, please refer to fig. 20, in another embodiment of the memory detection apparatus 50 provided in the embodiment of the present application, the memory detection apparatus 50 further includes a determining module 503;

the obtaining module 501 is further configured to determine a failure probability average value according to N failure probability scores after obtaining the N failure probability scores corresponding to the feature vector to be detected through a memory detection model;

the determining module 503 is configured to determine that the N failure probability scores satisfy the failure detection condition if the failure probability average value obtained by the obtaining module 501 is greater than or equal to a failure probability threshold;

the determining module 503 is further configured to determine that the N failure probability scores do not satisfy the failure detection condition if the failure probability average value obtained by the obtaining module 501 is smaller than the failure probability threshold.

Optionally, on the basis of the embodiment corresponding to fig. 19, please refer to fig. 21, in another embodiment of the memory detection apparatus 50 provided in the embodiment of the present application, the memory detection apparatus 50 further includes a determining module 503 and a processing module 504;

the determining module 503 is further configured to determine a risk level according to the memory detection result after the generating module 502 generates the memory detection result corresponding to the memory;

the processing module 504 is configured to perform replacement processing on the memory if the risk level determined by the determining module 503 is a high risk level, where the memory includes the memory;

the processing module 504 is further configured to perform data migration on data in a memory if the risk level determined by the determining module 503 is a medium risk level, where the memory includes the memory;

the processing module 504 is further configured to perform data migration on core data in a memory if the risk level determined by the determining module 503 is a low risk level, where the memory includes the memory.

Optionally, on the basis of the embodiment corresponding to fig. 19, please refer to fig. 22, in another embodiment of the memory detection apparatus 50 provided in the embodiment of the present application, the memory detection apparatus 50 further includes a processing module 504;

the processing module 504 is configured to, after the generating module 502 generates the memory detection result corresponding to the memory, perform data migration on data in the memory according to a first processing instruction if the first processing instruction is received, where the memory includes the memory;

the processing module 504 is further configured to, if a second processing instruction is received, perform replacement processing on a memory according to the second processing instruction, where the memory includes the memory.

Referring to fig. 23, please refer to fig. 23, in which fig. 23 is a schematic diagram of an embodiment of a memory detection apparatus in an embodiment of the present application, a memory detection apparatus 60 includes:

the acquiring module 601 is configured to acquire log data to be detected, where the log data to be detected includes log data corresponding to K memories, and K is an integer greater than or equal to 1;

a generating module 602, configured to generate a set of feature vectors to be detected according to remaining memories if it is detected that uncorrectable errors have occurred in P memories according to the log data to be detected acquired by the acquiring module 601, where the remaining memories are memories remaining after the P memories are removed from the K memories, the remaining memories include (K-P) memories, P is an integer greater than or equal to 0 and less than or equal to K, and the set of feature vectors to be detected includes (K-P) feature vectors to be detected;

the obtaining module 601 is further configured to obtain (K-P) failure probability score sets corresponding to the feature vector set to be detected generated by the generating module 602 through a memory detection model, where each failure probability score set includes N failure probability scores, the memory detection model includes N memory detection submodels, each memory detection submodel corresponds to one failure probability score, and N is an integer greater than or equal to 1;

a determining module 603, configured to determine T memories from the (K-P) memories according to the (K-P) failure probability score sets obtained by the obtaining module 601, where T is an integer greater than or equal to 0 and less than or equal to the (K-P);

the generating module 602 is further configured to generate a memory detection result according to the T memories and the P memories determined by the determining module 603.

The embodiment of the present application further provides another memory detection apparatus, where the memory detection apparatus may be deployed in a node, and the node may specifically be a terminal device, as shown in fig. 24, for convenience of description, only a part related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the method part of the embodiment of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal device as the mobile phone as an example:

fig. 24 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 24, the handset includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuit 760, wireless fidelity (WiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the handset configuration shown in fig. 24 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 24:

the RF circuit 710 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 780; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 710 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 710 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 may execute various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 720. The memory 720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, can collect touch operations of a user (e.g. operations of the user on or near the touch panel 731 by using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 731 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 780, and can receive and execute commands from the processor 780. In addition, the touch panel 731 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 740 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 740 may include a Display panel 741, and optionally, the Display panel 741 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 731 can cover the display panel 741, and when the touch panel 731 detects a touch operation on or near the touch panel 731, the touch operation is transmitted to the processor 780 to determine the type of the touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of the touch event. Although the touch panel 731 and the display panel 741 are shown as two separate components in fig. 24 to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 750, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 741 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 741 and/or a backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 can transmit the electrical signal converted from the received audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 and output; on the other hand, the microphone 762 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 760, and then processes the audio data output processor 780, and then transmits the audio data to, for example, another cellular phone through the RF circuit 710, or outputs the audio data to the memory 720 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 770, and provides wireless broadband Internet access for the user. Although fig. 24 shows the WiFi module 770, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby integrally monitoring the mobile phone. Optionally, processor 780 may include one or more processing units; optionally, processor 780 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 780.

The handset also includes a power supply 790 (e.g., a battery) for providing power to the various components, optionally, the power supply may be logically connected to the processor 780 via a power management system, so as to implement functions such as managing charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

acquiring a memory state historical data set, wherein the memory state historical data set comprises M memory state historical data, each memory state historical data corresponds to one memory, and M is an integer greater than or equal to 1;

generating a real fault label set according to the memory state historical data set, wherein the real fault label set comprises M real fault labels, and each real fault label corresponds to one memory;

generating a feature set to be trained according to the memory state historical data set, wherein the feature set to be trained comprises M features to be trained, each feature to be trained corresponds to one memory, and each feature to be trained comprises at least one parameter corresponding to a feature index;

training a to-be-trained memory detection model according to the to-be-trained feature set to obtain a predicted fault label set, wherein the predicted fault label set comprises M predicted fault labels, and each predicted fault label corresponds to one memory;

acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to a memory;

if the uncorrectable error UE does not appear in the memory according to the log data to be detected, generating a feature vector to be detected according to the memory;

and if the N fault probability scores meet the fault detection condition, generating a memory detection result corresponding to the memory.

acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to K memories, and K is an integer greater than or equal to 1;

if it is detected that uncorrectable errors occur in P memories according to the log data to be detected, generating a feature vector set to be detected according to remaining memories, wherein the remaining memories are memories remaining after the P memories are removed from the K memories, the remaining memories include (K-P) memories, P is an integer which is greater than or equal to 0 and less than or equal to K, and the feature vector set to be detected includes (K-P) feature vectors to be detected;

determining T memories from the (K-P) memories according to the (K-P) failure probability score sets, wherein T is an integer which is greater than or equal to 0 and less than or equal to the (K-P);

and generating a memory detection result according to the T memories and the P memories.

The present embodiment further provides another memory detection apparatus, which may be deployed in a node, where the node may specifically be a server, fig. 25 is a schematic structural diagram of a server provided in the present embodiment, and the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more CPUs 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) for storing an application 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the server 800.

The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiment may be based on the server configuration shown in fig. 25.

In the embodiment of the present application, the CPU 822 included in the server also has the following functions:

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for training a memory detection model is characterized by comprising the following steps:

acquiring a memory state historical data set from log data and/or a memory fault list, wherein the memory state historical data set comprises M memory state historical data, each memory state historical data corresponds to one memory, M is an integer greater than or equal to 1, and the memory state historical data set comprises: generating a set according to a first memory set, wherein the first memory set comprises memories of UE with uncorrectable errors; or, a set is generated according to a second memory set, wherein the second memory set comprises memories with built memory fault lists; or, generating a set according to a third memory set, where the third memory set includes a memory in which an uncorrectable error of the UE has occurred, or a memory in which a memory fault list has been established;

if the memory state historical data set is generated according to a first memory set, generating a real fault tag corresponding to the first memory according to the condition that UE occurs in the first memory set within a preset time, wherein the real fault tag is a first tag or a second tag, the first tag indicates that the UE occurs in the first memory within the preset time, and the second tag indicates that the UE does not occur in the first memory within the preset time;

if the memory state historical data set is generated according to a third memory set, generating a real fault tag corresponding to the third memory according to the condition that a memory fault list is established in the third memory set within a preset time or the condition that UE (user equipment) appears within the preset time, wherein the real fault tag is a first tag or a second tag, and the first tag represents that the memory fault list is established in the third memory within the preset time or the UE appears within the preset time; the second label indicates that the third memory has no memory fault list established within the preset time or no UE appears within the preset time; the real fault tag set comprises M real fault tags, each real fault tag corresponds to one memory, and the real fault tags are used for identifying whether the memories have faults or not;

2. The method of claim 1, wherein obtaining the memory state history data set comprises:

3. The method of claim 1, wherein obtaining the memory state history data set comprises:

4. The method of claim 1, wherein obtaining the memory state history data set comprises:

5. The method of claim 1, wherein generating a feature set to be trained according to the memory state historical data set comprises:

acquiring parameters corresponding to Q characteristic indexes according to the memory state historical data in the memory state historical data set, wherein the Q characteristic indexes comprise at least one of the number of correctable errors CE, the number of cells, the number of rows, the number of columns and the number of hard errors, and Q is an integer greater than or equal to 1;

6. The method of claim 1, wherein generating a feature set to be trained according to the memory state historical data set comprises:

generating a plurality of feature subsets to be trained according to the memory state historical data set, wherein the feature subsets to be trained belong to the feature sets to be trained;

training a memory detection model to be trained according to the feature set to be trained to obtain a predicted fault label set, including:

acquiring a predicted fault label subset corresponding to a feature subset to be trained through a memory detection submodel to be trained, wherein the predicted fault label subset belongs to the predicted fault label set, and the memory detection submodel to be trained belongs to one submodel of the memory detection models to be trained;

if the predicted fault label set and the real fault label set meet the model verification condition, determining that the memory detection model to be trained is a qualified memory detection model, including:

and if the predicted fault label subset and the real fault label subset meet the model verification condition, generating a memory detection submodel according to the memory detection submodel to be trained, wherein the memory detection submodel belongs to one of the memory detection submodels.

7. The method according to claim 1, wherein after training the memory detection model to be trained according to the feature set to be trained to obtain the predicted failure tag set, the method further comprises:

and if the label matching success rate is greater than or equal to a preset matching threshold, determining that the predicted fault label set and the real fault label set meet the model verification condition.

8. A method for memory detection, which is applied to a memory detection model trained by the method for training a memory detection model according to any one of claims 1 to 7, the method for memory detection comprising:

9. The method according to claim 8, wherein after obtaining the N failure probability scores corresponding to the feature vectors to be detected by the memory detection model, the method further comprises:

if the mean fault probability value is greater than or equal to a threshold fault probability value, determining that the N fault probability values meet the fault detection condition;

10. The method according to claim 8 or 9, wherein after the generating the memory detection result corresponding to the memory, the method further comprises:

determining a risk level according to the memory detection result;

if the risk level is a high risk level, performing replacement processing on a memory, wherein the memory comprises a DIMM;

if the risk level is a medium risk level, performing data migration on data in a memory, wherein the memory comprises the DIMM;

and if the risk level is a low risk level, performing data migration on core data in a memory, wherein the memory comprises the DIMM.

11. The method according to claim 8 or 9, wherein after the generating the memory detection result corresponding to the memory, the method further comprises:

if a first processing instruction is received, performing data migration on data in a memory according to the first processing instruction, wherein the memory comprises a DIMM;

and if a second processing instruction is received, performing replacement processing on a memory according to the second processing instruction, wherein the memory comprises the DIMM.

12. A method for memory detection, which is applied to a memory detection model trained by the method for training a memory detection model according to any one of claims 1 to 7, the method for memory detection comprising:

13. A memory test model training device, comprising:

an obtaining module, configured to obtain a memory state history data set from log data and/or a memory fault list, where the memory state history data set includes M memory state history data, each memory state history data corresponds to a memory, M is an integer greater than or equal to 1, and the memory state history data set includes: generating a set according to a first memory set, wherein the first memory set comprises memories of UE with uncorrectable errors; or, a set is generated according to a second memory set, wherein the second memory set comprises memories with built memory fault lists; or, generating a set according to a third memory set, where the third memory set includes a memory in which an uncorrectable error of the UE has occurred, or a memory in which a memory fault list has been established;

a generating module, configured to generate a real fault tag corresponding to a first memory according to a situation that a UE occurs in a first memory set within a preset time if the memory state historical data set is generated according to the first memory set, where the real fault tag is a first tag or a second tag, the first tag indicates that the UE occurs in the first memory within the preset time, and the second tag indicates that the UE does not occur in the first memory within the preset time;

the generating module is further configured to generate a feature set to be trained according to the memory state historical data set, where the feature set to be trained includes M features to be trained, each feature to be trained corresponds to one memory, and each feature to be trained includes at least one parameter corresponding to a feature index;

the obtaining module is further configured to train a to-be-trained memory detection model according to the to-be-trained feature set to obtain a predicted fault tag set, where the predicted fault tag set includes M predicted fault tags, and each predicted fault tag corresponds to one memory;

14. A memory test apparatus applied to a memory test model trained by the memory test model training apparatus according to claim 13, the memory test apparatus comprising:

the acquisition module is used for acquiring log data to be detected, wherein the log data to be detected comprises log data corresponding to a memory;

a generating module, configured to generate a feature vector to be detected according to the memory if it is detected that the uncorrectable error UE does not occur in the memory according to the log data to be detected acquired by the acquiring module;

the generating module is further configured to generate a memory detection result corresponding to the memory if the N failure probability scores obtained by the obtaining module satisfy a failure detection condition.

15. A terminal device, comprising: a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to execute a computer program stored in the memory;

the computer program is configured to perform a method of memory test model training according to any one of claims 1 to 7, and/or a method of memory test according to any one of claims 8 to 12.

16. A computer-readable storage medium, comprising:

the storage medium has stored therein a computer program for performing the method of memory test model training of any one of claims 1-7 and/or the method of memory test of any one of claims 8-12.