CN117668706A - Method and device for isolating memory faults of server, storage medium and electronic equipment - Google Patents

Method and device for isolating memory faults of server, storage medium and electronic equipment Download PDF

Info

Publication number
CN117668706A
CN117668706A CN202311709577.5A CN202311709577A CN117668706A CN 117668706 A CN117668706 A CN 117668706A CN 202311709577 A CN202311709577 A CN 202311709577A CN 117668706 A CN117668706 A CN 117668706A
Authority
CN
China
Prior art keywords
memory
data
fault
prediction model
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311709577.5A
Other languages
Chinese (zh)
Inventor
张晓斌
刘畅
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202311709577.5A priority Critical patent/CN117668706A/en
Publication of CN117668706A publication Critical patent/CN117668706A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The method, the device, the storage medium and the electronic equipment for isolating the memory faults of the server acquire target memory characteristic data of a target server based on memory key characteristic indexes; inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm; and under the condition that the memory failure prediction result is uncorrectable errors, performing failure isolation on the memory of the target server. According to the memory fault prediction model constructed by the random forest algorithm, the possible fault types of the memory can be accurately predicted, fault isolation is timely carried out aiming at the memory with uncorrectable errors, predictive maintenance of the memory of the server is facilitated, and service continuity and data safety of the server are guaranteed.

Description

Method and device for isolating memory faults of server, storage medium and electronic equipment
Technical Field
The disclosure relates to the field of artificial intelligence, and in particular relates to a method and a device for isolating memory faults of a server, a storage medium and electronic equipment.
Background
Data centers are used as cores of modern information technology, rely on large-scale server clusters for processing massive data and supporting diversified business applications, and server memory faults are also a problem that needs to be solved in a major way for operation and maintenance of the data centers.
Currently, although the memory ECC (ERROR CORRECTING CODE, error correction code) mechanism can provide a certain degree of high fault tolerance for the server and its memory, only the memory-occurring correctable errors (Correctable Errors, CE) are supported, and for the memory-occurring uncorrectable errors (Uncorrectable Errors, UCE), the operation and maintenance personnel are still required to perform timely processing including a failed memory replacement. When uncorrectable errors occur in the memory, server downtime risks easily occur, service interruption is caused, service continuity is affected, and data loss is easily caused.
Therefore, how to prevent the memory failure in advance is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above problems, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for isolating a memory failure of a server, which overcome or at least partially solve the above problems, where the technical solution is as follows:
a server memory fault isolation method comprises the following steps:
acquiring target memory characteristic data of a target server based on the memory key characteristic index;
inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm;
and under the condition that the memory failure prediction result is uncorrectable errors, performing failure isolation on the memory of the target server.
Optionally, the obtaining the target memory feature data of the target server based on the memory key feature index includes:
acquiring memory key index data of a target server based on the memory key characteristic index;
converting the memory key index data into memory feature vector data;
and carrying out data standardization processing on the memory characteristic vector data to obtain target memory characteristic data.
Optionally, the training process of the memory failure prediction model includes:
acquiring historical operation data of each server in a target server cluster;
performing feature engineering processing on the historical operation data to determine the memory key feature indexes;
in the historical operation data, determining historical memory characteristic data corresponding to the memory key characteristic index;
and training the memory fault prediction model by utilizing the historical memory characteristic data to obtain the trained memory fault prediction model.
Optionally, before the feature engineering processing is performed on the historical operating data to determine the memory key feature indicator, the method further includes:
and cleaning the historical operation data.
Optionally, the training the memory failure prediction model by using the historical memory feature data to obtain a trained memory failure prediction model includes:
adding corresponding fault type labels to each historical memory characteristic data;
dividing each historical memory characteristic data added with the fault type label into a model training set and a model testing set;
inputting the model training set into an initial random forest model for machine learning to obtain the memory fault prediction model;
inputting the model test set into the memory fault prediction model to perform performance evaluation, and obtaining a performance evaluation result;
and obtaining the memory fault prediction model with the passed performance evaluation result.
Optionally, the fault type tag includes a no fault tag, a correctable error tag, and an uncorrectable error tag.
Optionally, the memory key feature indicators include a memory CE value, a temperature, and a voltage.
A server memory failure isolation device, comprising: a target memory characteristic data obtaining unit, a memory fault type prediction result obtaining unit and a fault isolation unit,
the target memory characteristic data obtaining unit is used for obtaining target memory characteristic data of the target server based on the memory key characteristic indexes;
the memory fault type prediction result obtaining unit is used for inputting the target memory characteristic data into a pre-trained memory fault prediction model to obtain a memory fault type prediction result output by the memory fault prediction model, wherein the memory fault prediction model is a classification model based on a random forest algorithm;
the fault isolation unit is used for performing fault isolation on the memory of the target server under the condition that the memory fault prediction result is uncorrectable error.
A computer readable storage medium having stored thereon a program which when executed by a processor implements the server memory fault isolation method of any of the above.
An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete communication with each other through the bus; the processor is configured to invoke the program instructions in the memory to execute the server memory fault isolation method of any of the above.
By means of the technical scheme, the method and the device for isolating the memory faults of the server, the storage medium and the electronic equipment provided by the disclosure are used for obtaining target memory characteristic data of a target server based on memory key characteristic indexes; inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm; and under the condition that the memory failure prediction result is uncorrectable errors, performing failure isolation on the memory of the target server. According to the memory fault prediction model constructed by the random forest algorithm, the possible fault types of the memory can be accurately predicted, fault isolation is timely carried out aiming at the memory with uncorrectable errors, predictive maintenance of the memory of the server is facilitated, and service continuity and data safety of the server are guaranteed.
The foregoing description is merely an overview of the technical solutions of the present disclosure, and may be implemented according to the content of the specification in order to make the technical means of the present disclosure more clearly understood, and in order to make the above and other objects, features and advantages of the present disclosure more clearly understood, the following specific embodiments of the present disclosure are specifically described.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the disclosure. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a schematic flow chart of a method for isolating a memory failure of a server according to an embodiment of the disclosure;
FIG. 2 is a schematic flow chart illustrating another embodiment of a method for isolating a memory failure of a server according to an embodiment of the disclosure;
FIG. 3 is a flow chart illustrating a training process of a memory failure prediction model provided by an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a server memory fault isolation apparatus according to an embodiment of the disclosure;
fig. 5 shows a schematic structural diagram of an electronic device provided by an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
At present, the problem of memory failure of a server is an important challenge for operation and maintenance of a data center, and the traditional memory failure processing mode is to shut down the whole server to repair or replace a failed memory after service interruption is caused by the memory failure of the server, so that service continuity is seriously affected, and in the process of memory repair or replacement, the fact that the operation is in progress is extremely likely to be impossible to save in time, so that data is lost.
For large data centers in the financial industry, memory failures can pose a serious risk. For example: business interruption, data integrity threat and security loopholes even cause unstable financial markets and legal compliance problems, not only increase operation and maintenance cost, but also damage customer trust and reputation, and reduce competitiveness. Therefore, how to take effective measures to improve the availability and efficiency of memory fault handling and avoid potential risks becomes a key for maintaining high-quality operation of the data center.
As shown in fig. 1, a flowchart of an implementation manner of a server memory fault isolation method provided by an embodiment of the present disclosure may include:
s100, obtaining target memory characteristic data of a target server based on the memory key characteristic indexes.
The memory key characteristic index is a key index reflecting the memory fault characteristic of the server. Optionally, the memory key characteristic indicators include a memory CE value, a temperature, and a voltage.
The target memory characteristic data are data which can be identified and processed by the memory fault prediction model.
According to the embodiment of the disclosure, the data corresponding to the memory key characteristic indexes of the target device can be acquired in real time and converted into the target memory characteristic data which can be identified and processed by the memory fault prediction model.
S110, inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm.
The Random Forest (Random Forest) algorithm is a machine learning algorithm, and an integrated learning theory is adopted to combine a plurality of decision trees to process classification problems. According to the embodiment of the disclosure, the memory fault prediction model can be trained by using the historical memory characteristic data of the server in advance, so that the memory fault prediction model capable of predicting the memory fault is obtained.
The memory failure type prediction result may include any one of failure types of no failure, correctable errors (Correctable Errors, CE) and uncorrectable errors (Uncorrectable Errors, UCE), among others.
Since the correctable errors can be corrected by the memory controller based on a memory ECC (ERROR CORRECTING CODE, error correction code) mechanism, the server downtime is not caused. Accordingly, embodiments of the present disclosure focus on handling uncorrectable errors in memory that cannot be corrected by a memory ECC mechanism.
And S120, performing fault isolation on the memory of the target server under the condition that the memory fault prediction result is uncorrectable errors.
The fault isolation refers to isolating a fault from other normal modules through specific technical measures so as to ensure that after a certain module fails, other modules are not affected. Fault isolation measures provided by embodiments of the present disclosure may include shutting down dependencies, stopping synchronous calls, and automatically switching backups.
According to the method and the device, when the memory fault prediction result is uncorrectable errors, fault isolation measures can be adopted for the memory of the target server, and an alarm is given to inform operation and maintenance personnel to repair or replace the memory which is about to be faulty as soon as possible, so that the service continuity and high availability of the target server are maintained, meanwhile, the automatic response of fault isolation can be beneficial to rapidly coping with the memory fault problem, and the response time and efficiency are improved.
According to the method and the device for the data center, under the background of operation and maintenance of mass servers, the fault condition of the memory can be predicted in advance, and the fault memory is isolated, so that the data center can take measures in advance, the performance of the servers is optimized in time, the servers are prevented from being down or becoming unavailable due to the memory problem, the availability and the stability of the servers are ensured, and the efficient and stable operation of the large server clusters is further ensured.
According to the server memory fault isolation method, the target memory characteristic data of the target server are obtained based on the memory key characteristic indexes; inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm; and under the condition that the memory failure prediction result is uncorrectable errors, performing failure isolation on the memory of the target server. According to the memory fault prediction model constructed by the random forest algorithm, the possible fault types of the memory can be accurately predicted, fault isolation is timely carried out aiming at the memory with uncorrectable errors, predictive maintenance of the memory of the server is facilitated, and service continuity and data safety of the server are guaranteed.
Optionally, based on the method shown in fig. 1, as shown in fig. 2, a flowchart of another implementation of the method for isolating a memory failure of a server according to an embodiment of the present disclosure may include:
s101, obtaining memory key index data of a target server based on the memory key characteristic index.
The memory key index data is data corresponding to the memory key characteristic index of the target server in the production environment.
The embodiment of the disclosure can monitor the data of the target server under each memory key characteristic index in real time, so as to obtain the memory key index data of the target server under each memory key characteristic index.
S102, converting the memory key index data into memory feature vector data.
For ease of understanding, this is illustrated by way of example: assuming that the memory key index data obtained in the embodiment of the present disclosure includes a memory CE value of 3, a temperature of 30 settings, and a voltage of 12 volts, the converted memory feature vector data is [3, 30, 12].
S103, performing data standardization processing on the memory feature vector data to obtain target memory feature data.
Wherein the data normalization is to scale data with different original distribution ranges within a range. For example: the embodiment of the disclosure can convert each parameter in the memory characteristic vector data into a numerical value between 0 and 1.
According to the memory fault prediction method and device, through data vectorization and data standardization, memory key index data can be converted into target memory characteristic data which can be identified by a memory fault prediction model, so that fault prediction is accurately carried out on a memory of a target server, a fault type of a memory of the target server, which is about to cause faults, is obtained, fault isolation is facilitated on the memory which is about to cause uncorrectable errors in time, and service continuity and data safety of the server are guaranteed.
Optionally, as shown in fig. 3, a flowchart of a training process of a memory failure prediction model provided in an embodiment of the present disclosure may include:
s200, historical operation data of each server in the target server cluster are obtained.
The target server cluster may be a server cluster deployed in a data center. The method and the device can collect the historical operation data of the total quantity of all servers in the target server cluster so as to ensure that high-quality data with correct time stamps and integrity are obtained, and therefore accuracy of subsequent model training is ensured.
And S210, performing feature engineering processing on the historical operation data to determine memory key feature indexes.
Wherein feature engineering (Feature Engineering) is the process of converting raw data into features that better express the nature of the problem, such that applying these features to a predictive model can improve model prediction accuracy for invisible data. According to the embodiment of the disclosure, the memory key characteristic indexes with important relevance to the memory faults can be screened out through characteristic engineering based on historical operation data.
S220, determining historical memory characteristic data corresponding to the memory key characteristic indexes in the historical operation data.
According to the embodiment of the disclosure, the historical memory key index data corresponding to the memory key index can be screened out from the historical operation data, the historical memory key index data are converted into the historical memory feature vector data, and then the data normalization processing is carried out on the historical memory feature vector data, so that the historical memory feature data are obtained. For example: assume that the obtained historical memory key indicator data includes a memory CE value: 2,0,1], temperature (degrees celsius): [30, 35, 33], voltage (volts): [12, 12.5, 12.3]; the possible result of the conversion into feature vectors is observation 1: [2, 30, 12], observations 2: [0, 35, 12.5], observations 3[1, 33, 12.3], the result of further data normalization may be observations 1: [1, 0], observations 2: [0, 1], observations 3: [0.5,0.6,0.6]. In this example, the memory CE value, voltage and temperature are all translated into feature vectors, and the data normalization ensures that each feature's impact on the model is balanced.
And S230, training the memory failure prediction model by utilizing the historical memory characteristic data to obtain a trained memory failure prediction model.
According to the embodiment of the disclosure, the memory key characteristic indexes are determined through characteristic engineering by collecting the historical operation data of the total server, and the memory fault prediction model is trained based on the historical memory characteristic data corresponding to the memory key characteristic indexes, so that the effectiveness of the memory fault prediction model is ensured.
Optionally, prior to step S210, the embodiment of the present disclosure may further perform data cleansing on the historical operating data.
Specifically, the embodiment of the disclosure can process the missing value and the abnormal value in the historical operation data to ensure the consistency of the data.
According to the embodiment of the disclosure, the data cleaning is performed on the historical operation data, so that the validity of the memory key characteristic index determined through the characteristic engineering can be ensured, high-quality training data is provided for the memory failure prediction model, and the memory failure prediction model with high prediction accuracy is obtained.
Optionally, the embodiment of the disclosure may add a corresponding fault type tag to each historical memory characteristic data. And dividing the historical memory characteristic data added with the fault type tag into a model training set and a model testing set. And inputting the model training set into an initial random forest model for machine learning to obtain a memory fault prediction model. And inputting the model test set into a memory fault prediction model to perform performance evaluation, and obtaining a performance evaluation result. And obtaining a memory fault prediction model with a passed performance evaluation result.
Specifically, in the embodiment of the disclosure, the historical memory characteristic data may be divided into a model training set and a model testing set according to a ratio of 7:3, the memory failure prediction model is machine-learned by the model training set, and the trained memory failure prediction model is model-evaluated by the model testing set, where the model evaluation result may include the prediction accuracy of the memory failure prediction model to the model testing set. The embodiment of the disclosure can also evaluate the performance of the memory fault prediction model under different thresholds by drawing ROC curves, confusion matrixes and other evaluation modes. According to the embodiment of the disclosure, the memory fault prediction model with high prediction accuracy can be obtained through multiple model iteration and parameter tuning.
The embodiment of the disclosure can periodically update and retrain the memory failure prediction model to adapt to the environmental change of the server cluster of the data center.
Optionally, the fault type tags include no fault tags, correctable error tags, and uncorrectable error tags.
According to the method and the device for the memory failure prediction, corresponding failure type labels can be set according to common failure types of the memory of the target server, so that the trained memory failure prediction model can effectively identify the failure types of failures which occur in the memory.
Although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
Corresponding to the above method embodiment, the embodiment of the present disclosure further provides a device for isolating a memory failure of a server, where the structure of the device is shown in fig. 4, and may include: the memory fault type prediction result obtaining unit 200 and the fault isolation unit 300 are connected to the target memory feature data obtaining unit 100.
The target memory feature data obtaining unit 100 is configured to obtain target memory feature data of the target server based on the memory key feature index.
The memory failure type prediction result obtaining unit 200 is configured to input the target memory feature data into a pre-trained memory failure prediction model, and obtain a memory failure type prediction result output by the memory failure prediction model, where the memory failure prediction model is a classification model based on a random forest algorithm.
The fault isolation unit 300 is configured to perform fault isolation on the memory of the target server when the memory fault prediction result is an uncorrectable error.
Optionally, the target memory feature data obtaining unit 100 may be specifically configured to obtain memory key indicator data of the target server based on the memory key feature indicator; converting the memory key index data into memory feature vector data; and carrying out data standardization processing on the memory characteristic vector data to obtain target memory characteristic data.
Optionally, the server memory fault isolation apparatus may further include: the model training unit can comprise a historical operation data subunit, a memory key characteristic index determining subunit, a historical memory characteristic data determining subunit and a memory fault prediction model obtaining subunit.
And the historical operation data subunit is used for obtaining the historical operation data of each server in the target server cluster.
And the memory key characteristic index determining subunit is used for carrying out characteristic engineering processing on the historical operation data to determine the memory key characteristic index.
And the historical memory characteristic data determining subunit is used for determining the historical memory characteristic data corresponding to the memory key characteristic index in the historical operation data.
The memory failure prediction model obtaining subunit is used for training the memory failure prediction model by utilizing the historical memory characteristic data to obtain a trained memory failure prediction model.
Optionally, the model training unit may further include: a data cleansing subunit.
The data cleaning subunit is used for carrying out characteristic engineering processing on the historical operation data by the memory key characteristic index determining subunit and carrying out data cleaning on the historical operation data before the memory key characteristic index is determined.
Optionally, the memory failure prediction model obtaining subunit may be specifically configured to add a corresponding failure type tag to each historical memory feature data. And dividing the historical memory characteristic data added with the fault type tag into a model training set and a model testing set. And inputting the model training set into an initial random forest model for machine learning to obtain a memory fault prediction model. And inputting the model test set into a memory fault prediction model to perform performance evaluation, and obtaining a performance evaluation result. And obtaining a memory fault prediction model with a passed performance evaluation result.
Optionally, the fault type tags include no fault tags, correctable error tags, and uncorrectable error tags.
Optionally, the memory key characteristic indicators include a memory CE value, a temperature, and a voltage.
The server memory fault isolation device provided by the disclosure obtains target memory characteristic data of a target server based on memory key characteristic indexes; inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm; and under the condition that the memory failure prediction result is uncorrectable errors, performing failure isolation on the memory of the target server. According to the memory fault prediction model constructed by the random forest algorithm, the possible fault types of the memory can be accurately predicted, fault isolation is timely carried out aiming at the memory with uncorrectable errors, predictive maintenance of the memory of the server is facilitated, and service continuity and data safety of the server are guaranteed.
The specific manner in which the individual units perform the operations in relation to the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method and will not be described in detail here.
The server memory fault isolation apparatus includes a processor and a memory, where the target memory feature data obtaining unit 100, the memory fault type prediction result obtaining unit 200, the fault isolation unit 300, and the like are stored as program units in the memory, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can set one or more than one memory fault prediction model constructed by a random forest algorithm by adjusting kernel parameters, accurately predicts the possible fault types of the memory, timely performs fault isolation aiming at the memory with uncorrectable errors of the fault types, is beneficial to predictive maintenance of the memory of the server, and ensures the service continuity and data security of the server.
Embodiments of the present disclosure provide a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements the server memory fault isolation method.
The embodiment of the disclosure provides a processor for running a program, wherein the program runs to execute the server memory fault isolation method.
As shown in fig. 5, an embodiment of the present disclosure provides an electronic device 1000, the electronic device 1000 comprising at least one processor 1001, and at least one memory 1002, bus 1003 connected to the processor 1001; wherein, the processor 1001 and the memory 1002 complete communication with each other through the bus 1003; the processor 1001 is configured to invoke the program instructions in the memory 1002 to perform the server memory fault isolation method described above. The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present disclosure also provides a computer program product adapted to perform a program that, when executed on an electronic device, initializes the steps of a server memory failure isolation method.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, the electronic device includes one or more processors (CPUs), memory, and a bus. The electronic device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
In the description of the present disclosure, it should be understood that, if the directions or positional relationships indicated by the terms "upper", "lower", "front", "rear", "left" and "right", etc., are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the positions or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limitations of the present disclosure.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the present disclosure. Various modifications and variations of this disclosure will be apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present disclosure, are intended to be included within the scope of the claims of the present disclosure.

Claims (10)

1. The method for isolating the memory faults of the server is characterized by comprising the following steps of:
acquiring target memory characteristic data of a target server based on the memory key characteristic index;
inputting the target memory characteristic data into a pre-trained memory failure prediction model to obtain a memory failure type prediction result output by the memory failure prediction model, wherein the memory failure prediction model is a classification model based on a random forest algorithm;
and under the condition that the memory failure prediction result is uncorrectable errors, performing failure isolation on the memory of the target server.
2. The method of claim 1, wherein the obtaining the target memory feature data of the target server based on the memory key feature indicator comprises:
acquiring memory key index data of a target server based on the memory key characteristic index;
converting the memory key index data into memory feature vector data;
and carrying out data standardization processing on the memory characteristic vector data to obtain target memory characteristic data.
3. The method of claim 1, wherein the training process of the memory failure prediction model comprises:
acquiring historical operation data of each server in a target server cluster;
performing feature engineering processing on the historical operation data to determine the memory key feature indexes;
in the historical operation data, determining historical memory characteristic data corresponding to the memory key characteristic index;
and training the memory fault prediction model by utilizing the historical memory characteristic data to obtain the trained memory fault prediction model.
4. The method of claim 3, wherein prior to said feature engineering the historical operating data to determine the memory key feature indicator, the method further comprises:
and cleaning the historical operation data.
5. The method of claim 3, wherein training the memory failure prediction model using the historical memory characterization data to obtain the trained memory failure prediction model comprises:
adding corresponding fault type labels to each historical memory characteristic data;
dividing each historical memory characteristic data added with the fault type label into a model training set and a model testing set;
inputting the model training set into an initial random forest model for machine learning to obtain the memory fault prediction model;
inputting the model test set into the memory fault prediction model to perform performance evaluation, and obtaining a performance evaluation result;
and obtaining the memory fault prediction model with the passed performance evaluation result.
6. The method of claim 5, wherein the fault type tags include a no fault tag, a correctable error tag, and an uncorrectable error tag.
7. The method of any one of claims 1 to 6, wherein the memory critical characteristic metrics include a memory CE value, a temperature, and a voltage.
8. A server memory fault isolation apparatus, comprising: a target memory characteristic data obtaining unit, a memory fault type prediction result obtaining unit and a fault isolation unit,
the target memory characteristic data obtaining unit is used for obtaining target memory characteristic data of the target server based on the memory key characteristic indexes;
the memory fault type prediction result obtaining unit is used for inputting the target memory characteristic data into a pre-trained memory fault prediction model to obtain a memory fault type prediction result output by the memory fault prediction model, wherein the memory fault prediction model is a classification model based on a random forest algorithm;
the fault isolation unit is used for performing fault isolation on the memory of the target server under the condition that the memory fault prediction result is uncorrectable error.
9. A computer-readable storage medium having a program stored thereon, wherein the program, when executed by a processor, implements the server memory fault isolation method of any of claims 1 to 7.
10. An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete communication with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the server memory fault isolation method of any of claims 1 to 7.
CN202311709577.5A 2023-12-13 2023-12-13 Method and device for isolating memory faults of server, storage medium and electronic equipment Pending CN117668706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311709577.5A CN117668706A (en) 2023-12-13 2023-12-13 Method and device for isolating memory faults of server, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311709577.5A CN117668706A (en) 2023-12-13 2023-12-13 Method and device for isolating memory faults of server, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117668706A true CN117668706A (en) 2024-03-08

Family

ID=90082390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311709577.5A Pending CN117668706A (en) 2023-12-13 2023-12-13 Method and device for isolating memory faults of server, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117668706A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971547A (en) * 2024-03-29 2024-05-03 苏州元脑智能科技有限公司 Memory fault prediction method, device, equipment, storage medium and program product
CN118132350A (en) * 2024-04-29 2024-06-04 苏州元脑智能科技有限公司 CXL memory fault tolerance method, server system, storage medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971547A (en) * 2024-03-29 2024-05-03 苏州元脑智能科技有限公司 Memory fault prediction method, device, equipment, storage medium and program product
CN117971547B (en) * 2024-03-29 2024-06-21 苏州元脑智能科技有限公司 Memory fault prediction method, device, equipment, storage medium and program product
CN118132350A (en) * 2024-04-29 2024-06-04 苏州元脑智能科技有限公司 CXL memory fault tolerance method, server system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US10552729B2 (en) Baseboard management controller to deconfigure field replaceable units according to deep learning model
US11269718B1 (en) Root cause detection and corrective action diagnosis system
CN117668706A (en) Method and device for isolating memory faults of server, storage medium and electronic equipment
US10579459B2 (en) Log events for root cause error diagnosis
US8055960B2 (en) Self test apparatus for identifying partially defective memory
US20150067410A1 (en) Hardware failure prediction system
US9734015B2 (en) Pre-boot self-healing and adaptive fault isolation
US9860109B2 (en) Automatic alert generation
US11561875B2 (en) Systems and methods for providing data recovery recommendations using A.I
CN112380089A (en) Data center monitoring and early warning method and system
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
CN111221775A (en) Processor, cache processing method and electronic equipment
CN114691409A (en) Memory fault processing method and device
CN111159029B (en) Automated testing method, apparatus, electronic device and computer readable storage medium
CN114860487A (en) Memory fault identification method and memory fault isolation method
Zhang et al. Predicting dram-caused node unavailability in hyper-scale clouds
CN114153646A (en) Operation and maintenance fault handling method and device, storage medium and processor
WO2024027325A1 (en) Memory fault handling methods and apparatuses, and storage medium
CN115421947A (en) Memory fault processing method and device and storage medium
JP2021005379A (en) Method for detecting deep learning chip, device, electronic apparatus, and computer storage medium
CN111581062A (en) Service fault processing method and server
CN117912534B (en) Disk state prediction method and device, electronic equipment and storage medium
CN115269245B (en) Memory fault processing method and computing device
US20240004765A1 (en) Data processing method and apparatus for distributed storage system, device, and storage medium
CN116149971B (en) Equipment fault prediction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination