CN118055015A

CN118055015A - Method and device for predicting node fault abnormality of super computing system

Info

Publication number: CN118055015A
Application number: CN202410273909.8A
Authority: CN
Inventors: 赵一宁; 王小宁; 肖海力
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-05-17

Abstract

A super computing system node fault exception prediction method is applied to a super computing system cluster, wherein the super computing system cluster comprises a plurality of nodes, and the method comprises the following steps: acquiring operation logs and system monitoring data of each node, and performing preliminary processing on the operation logs and the monitoring data; extracting and marking the abnormal information recorded in each primarily processed running log; classifying each of the primarily processed monitoring data; determining node operation characteristics before abnormality based on the classified monitoring data and the extracted abnormality information, and determining feature vectors of the node operation characteristics; determining a prediction model based on the feature vector; and inputting the real-time operation data of each node into a prediction model, and predicting the system operation state of each node. The method can find potential abnormal precursors of each node in the super-computing system and infer the possibility of node faults in the super-computing system.

Description

Method and device for predicting node fault abnormality of super computing system

Technical Field

The invention relates to the technical field of information technology (information technology, IT), in particular to a method and a device for predicting node fault exception of a super computing system.

Background

The super computing is an important method for realizing the acceleration operation of large-scale computing tasks through a high-performance computer system, has a remarkable pushing effect on computing tasks in various scientific research and industrial fields, and can greatly shorten the time consumed by complex computing tasks so as to improve the working efficiency, so that the super computing has extremely important significance on national technological development, industrial and economic construction, civilian life and health guarantee. Modern supercomputing systems continue to scale up, and the number of compute nodes is also rapidly increasing. In actual operation, the compute nodes inevitably have different grades of problems such as abnormality, error, fault and the like, which affect the stable operation of the system and the smooth execution of the super computing application program. The failed node typically restarts, either manually or automatically, and causes the computing task scheduled to that node to be forced to cease, with all intervening computing processes lost, the scheduling system having to assign other computing nodes again for that task and resume performing computations. Interruption of a computing task results in unnecessary consumption of computing time and corresponding repetitive consumption of computing resources, while negatively impacting the user submitting the job, especially when the computing task is large in size, the unnecessary consumption due to its forced interruption may cause heavy cost loss. In response to this problem, researchers have proposed solutions that involve keeping jobs running on potentially faulty nodes in their state and migrating to other computing nodes for continued operation. However, in order to achieve this, it is necessary to predict whether a node has a fault or not in advance, so that node fault prediction becomes an important research direction. Therefore, how to predict the failure abnormality of the node is a technical problem to be solved.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the application provides a method, a device, a computing device, a computer storage medium and a product containing a computer program for predicting node faults of a super computing system, which can discover potential abnormal precursors and infer the possibility of occurrence of the node faults in the super computing system.

In a first aspect, an embodiment of the present application provides a method for predicting failure exception of a node in a supercomputing system, which is applied to a supercomputing system cluster, where the supercomputing system cluster includes a plurality of nodes, and the method includes: acquiring operation logs and system monitoring data of each node, and performing preliminary processing on the operation logs and the monitoring data; extracting and marking the abnormal information recorded in each primarily processed running log; classifying each of the primarily processed monitoring data; determining node operation characteristics before abnormality based on the classified monitoring data and the extracted abnormality information, and determining feature vectors of the node operation characteristics; determining a prediction model based on the feature vector; and inputting the real-time operation data of each node into a prediction model, and predicting the system operation state of each node.

In some possible implementations, the preliminary processing of the running log and the monitoring data includes: and acquiring templates contained in the log information through a log classification method.

In some possible implementations, classifying each of the preliminary processed monitoring data includes: clustering the monitoring data to obtain a normal value of the monitoring data; determining an offset value of each monitoring data; the offset value is used for representing the offset degree of the monitoring data compared with the normal value of the monitoring data; based on the offset value, abnormal data included in the monitor data is determined.

In some possible implementations, determining the anomaly data included in the monitored data includes: the monitoring data conforming to the following formula is determined as the abnormal data, the formula is expressed as:

ab_N＞pab_N×t_a

Where ab _N represents an average offset value of node N among the plurality of nodes, pab _N represents an average offset value within a fixed one-time period before the current time, and t _a represents a coefficient of a preset first threshold, where the coefficient of the first threshold is 1.5.

In some possible implementations, determining the node operational characteristics before the anomaly occurs includes: and determining the node operation characteristics of the key monitoring indexes before the occurrence of the abnormality.

In some possible implementations, the selecting of the key monitoring indicator includes: determining an associated span window, wherein the associated span window is used for indicating a fixed time period before the occurrence time of the fault abnormality; counting the occurrence times of fault abnormality in the associated span window; determining the failure precursor probability of the index m based on the occurrence times of failure abnormality; and determining the index m as a key monitoring index in the case that the failure precursor probability is larger than a second threshold value.

In some possible implementations, the second threshold is 0.1.

In some possible implementations, determining a feature vector of a node operational feature includes: determining a vector schema according to the monitoring index data and the operation load of the node, wherein the vector schema is used for representing the node operation characteristics; the node operational characteristics are converted into operational characteristic vectors based on the vector schema.

In some possible implementations, the system monitor data is monitor data for system monitor metrics including processor, disk capacity, memory occupancy, disk read-write, network transport, and InfiniBand conditions.

In a second aspect, an embodiment of the present application provides an anomaly prediction apparatus deployed in a super computing system cluster, where the super computing system cluster includes a plurality of nodes, the apparatus includes: the acquisition module is used for acquiring the operation log and the system monitoring data of each node and carrying out preliminary processing on the operation log and the monitoring data; extracting and marking the abnormal information recorded in each primarily processed running log; classifying each of the primarily processed monitoring data; the processing module is used for determining node operation characteristics before abnormality based on the classified monitoring data and the extracted abnormality information and determining characteristic vectors of the node operation characteristics; determining a prediction model based on the feature vector; and the processing module is also used for inputting the real-time operation data of each node into the prediction model and predicting the system operation state of each node.

In a third aspect, embodiments of the present application provide a computer-readable storage medium comprising computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method of any of the first aspects.

In a fourth aspect, an embodiment of the present application provides a computing device comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, perform the method according to any of the first aspects.

In a fifth aspect, embodiments of the present application provide a product comprising a computer program which, when run on a processor, causes the processor to perform the method according to any of the first aspects.

The method provided by the application can find potential abnormal precursors of each node in the super computing system cluster and infer the possibility of node faults in the super computing system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an anomaly prediction method according to an embodiment of the present application;

FIG. 2 is a line graph of monitoring index data in a time dimension according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an abnormality prediction apparatus according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The term "and/or" herein is an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The symbol "/" herein indicates that the associated object is or is a relationship, e.g., A/B indicates A or B.

The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects. For example, the first response message and the second response message, etc. are used to distinguish between different response messages, and are not used to describe a particular order of response messages.

In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise specified, "index" means a monitoring index, "a plurality" means two or more, for example, a plurality of processing units means two or more processing units and the like; the plurality of elements means two or more elements and the like.

For the purpose of facilitating an understanding of the embodiments of the present application, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the embodiments of the application.

A plurality of computing nodes are typically included in a supercomputing system. During system operation, the computing node may inevitably generate an abnormality, including a system error and/or a jam of the computing node due to a non-manual reason, which causes the computing node to be disconnected and/or restarted, and a chronic and/or sudden state change of the computing node. An anomaly may generally include multiple levels, such as anomalies, faults, and the like. In the embodiments of the present application, the anomalies are referred to as fault anomalies, including faults and anomalies, and unless otherwise noted, anomalies are referred to hereinafter as fault anomalies. The failure refers to a system error and/or a dead lock caused by a non-manual cause, which leads to a disconnection and/or a restarting of the computing node, and at this time, all jobs belonging to an unsaved state on the computing node are lost, so that a large resource waste is caused. The exception refers to the occurrence of chronic and/or sudden state change of the computing node, and at the moment, the exception can cause the computing node to generate obvious performance fluctuation, so that the computing node cannot work normally and efficiently, and the use efficiency of computing resources of the super computing system is reduced. In the method research based on real super computing system operation data for exception prediction, as the original data does not have obvious node fault marks, it is impossible to know which computing nodes in the super computing system have fault exceptions at which time points, and therefore, a supervised classification method cannot be used. Meanwhile, the original data is complex in variety, various in characteristics and unique in content, so that the data characteristics cannot be used for prediction.

In view of the above, an embodiment of the present application provides an anomaly prediction method, based on machine learning, for processing early signs related to an anomaly of a computing node through data analysis, marking node fault anomaly information and finding operation features related to the node fault anomaly information, and sending out early warning for a premonitory event in an actual operation process, thereby implementing anomaly prediction.

Exemplary, fig. 1 shows a flow chart of a method for predicting failure exception of a node of a supercomputer system, which is provided by an embodiment of the present application, and the method may be used for predicting when an exception occurs to a node in a supercomputer system cluster. As shown in fig. 1, the one-field prediction method may include the steps of:

S11: and acquiring the operation log and system monitoring data of each node in the super computing system cluster, and performing preliminary processing on the operation log and the monitoring data.

In this embodiment, to explore possible node failure anomaly premonitory events, the running logs of each node in the supercomputer system cluster and real-time system monitoring data may be acquired first, so as to find potential correlations between anomaly generation and certain parameters in system operation through data analysis, where the parameters may include system monitoring data. The system monitoring data is that the system monitoring tool of each node of the target super computing system scans the system index status in real time and sets to make sampling record once every fixed period (for example, every half minute, one minute, etc.), so the system monitoring data can also be called as the system monitoring index data. The system monitoring data is numeric class data, the content of which is data that can be read by a computing device, but the data needs to be indexed and read from different dimensions such as time and index.

In some possible embodiments, the metrics (i.e., monitored objects) monitored by the system may include 59 metrics from six classes, including processor, disk capacity, memory usage, disk read and write, network transport, and InfiniBand conditions.

In the embodiment of the application, the implementation of the abnormality prediction method is performed by taking a large-scale advanced super computing system of China academy of sciences as an example. In an implementation, a log of actual operational data of a portion of nodes in the cluster may be obtained. The system employs Slurm as a cluster management and task scheduling system. The data content mainly comprises the following categories: the Syslog system log of the Linux operating system of each node, the master service work log of the cluster management and scheduling system Slurm, the work log of the network manager OpenSM of the cluster InfiniBand, the system operation monitoring index data of each node, and the like. The Syslog log mainly comprises important information such as system behavior, network access and the like, the Slurm log mainly comprises information such as communication conditions of a dispatching master node and each computing node, node availability, distribution and completion conditions of super computing operation and the like, the OpenSM log comprises information such as network communication conditions, route changes, node description and state changes and the like, and the system monitoring indexes mainly refer to data obtained by carrying out timing monitoring and recording on various indexes such as processor and memory occupation conditions, disk reading and writing, network communication and the like in a system.

After the log is obtained, preliminary processing is required for the log. Since the log belongs to text-like data, the content of which can be read by a person, but is not favorable for processing by a machine (computing device), it is necessary to convert log information into information that can be processed by the machine. The huge amount of data makes the operation impossible to perform in time by manual processing, and therefore, the processing flow needs to be simplified by a machine. In the embodiment of the application, a log classification method can be adopted to acquire templates contained in log information, and further statistical analysis work is performed on the templates, such as extracting variable information in log events, counting the occurrence times of each log event in a certain period of time, and the like.

For example, referring to table 1, table 1 shows an example of log template statistics after sorting logs.

TABLE 1

Log source	Template number
		Syslog	812
Slurm	197
		OpenSM	30

S12: and extracting and marking the abnormal information recorded in each preprocessed running log, and classifying the monitoring data.

In this embodiment, in order to obtain the premonitory event associated with the fault exception, it is necessary to find the representation of the fault exception itself, and extract and mark the fault exception information, so that the fault exception information can be found conveniently later.

For determining exception information in the processed running log, the fault information is usually recorded or embodied in log data, for example, the exception usually causes a node to be turned off, restarted, and the like. Checking all the extracted log templates, searching the information of Node closing, node restarting and the like contained in the log templates, and in Slurm, seeing the information that the Node does not respond, wherein the corresponding template is "error: nodes < nodeID > non response", a part of the nodes recover response events after the non-response event occurs, and the corresponding template information is "Node < nodeID > non response", and the corresponding template information is marked that the non-response Node only temporarily responds or has failed in communication and has recovered to be normal. But the rest of the nodes have no response event and have no recovery response event, which is also reflected in the Syslog of the system, representing that the node has failed once (i.e. has an anomaly). The node unresponsive event occurs in the Slurm log and the responding node is not restored later, and the Syslog log of the node is also mostly generated with a node restarting event (a series of system starting operations including service initialization, starting various basic processes and the like) at adjacent time, or no new log appears for a long time from the time until the restarting event occurs (representing that the node enters a downtime state but is not restarted immediately). In this case, it may be determined that the event is a primary node failure, and the point in time at which the failure occurs is recorded.

For determining anomaly information in the monitored data, the anomaly information is typically embodied in the monitored data. For the monitoring indicators, the indicated states may include a normal state (which may be simply referred to as a normal state) and an abnormal state (i.e., an abnormality occurs), and therefore, the monitored value of each indicator needs to be classified to determine the abnormality information.

In some possible embodiments, classifying the monitored values of the metrics may employ a method that is similar to a density-based clustering algorithm. Specifically, determining the anomaly information may include the steps of:

s121: and clustering the monitoring values, wherein the categories comprise a normal category, an abnormal category and other categories.

In this embodiment, a monitoring value set of the monitoring index m of the nth node in the nodes may be obtained, and the monitoring value is denoted as V, so as to obtain a monitoring value set V _m＝{v₁,v₂,...v_n. The maximum value and the minimum value are obtained, the maximum value is denoted as v _max, and the minimum value is denoted as v _min. Setting cluster density radius, and recording the radius asAn empty cluster set CS is established, and the existence/>Each element in the collection is a subset C _j of the monitored values and has/>Each element V _i within the set of scan monitor values V _m is scanned and the maximum value/>, in each element C _j within the set of clusters CS is comparedAnd minimum valueSatisfy/>, at the monitored value v _i In the case of (2), then the monitored value v _i is added to element C _j. If the monitored value v _i cannot join any element C _j, a new cluster C _n is established for the monitored value v _i and C _n is added to the cluster set CS. After the clustering is completed for all the monitor values v _i, the number of monitor values |c _j | and the average avg (C _j) within each element in the calculation are calculated.

After each monitoring value is clustered, for all clusters in the cluster set CS, since the normal operation state of the node is most common, a class with the largest number of monitoring values is selected, which is defined as a normal class, denoted as C _norm, and an average value avg (Cnorm) of the normal class is defined as a normal value of the monitoring index m of the node (nth node), denoted as v _norm. Any cluster C _j other than the normal class, if its monitored value number |C _j | satisfiesThe cluster C _j is defined as other class. And defining the clusters outside the normal class and other classes as abnormal classes. In the data set collected by the embodiment of the application, the number of the monitoring values in the normal state class corresponding to the monitoring indexes of all the nodes exceeds 50% of the number of all the elements of the index, and the number of the normal state class elements with the highest proportion exceeds 99% of the number of all the elements of the index.

S122: an offset value for each of the monitored data is determined.

In this embodiment, although whether the index of the node is abnormal at a certain moment can be obtained by the clustering method, whether the overall state of the node is abnormal or not cannot be represented by a single index abnormality, and it is necessary to perform overall judgment on all the indexes. In order to measure the abnormal condition of all indexes, an offset value is introduced in the embodiment of the application and is denoted as b. The offset value refers to the degree of offset of each index monitor value from its normal value. For the monitoring index m of the node N, its normal value is v _norm, find the maximum difference between each monitoring value in the monitoring value set and the normal value v _norm, denoted as d _max＝max(abs(v_i-v_norm)), then for any time point x in the monitoring value set, the offset value of the monitoring value v _x is expressed as

S123: based on the offset value, anomaly data is determined.

In this embodiment, after the calculation of the offset value of the monitoring value is completed, the average offset value of the key monitoring index may be used to represent the state and stability of the super computing system running on the monitoring layer. The average offset value of one node N (nth node) in the supercomputing system at the time point x isIn the method, n represents the number of key monitoring indexes,/>The offset value of the monitoring index m _i is characterized, and v _x is the value of the monitoring index m _i at the time point x. Based on the coefficient t _a of the first threshold value set in advance, an average offset value pab _N within a period of time (for example, 3h, 6h, etc.) before the current time is determined, and in the case of the current average offset value ab _N＞pab_N×t_a of the node N, the data at this time is marked as abnormal data. After a large number of experiments and calculations, the effect of the embodiment of the method is better when the coefficient t _a of the first threshold is 1.5. Therefore, the coefficient t _a of the first threshold is taken to be 1.5.

S13: and determining the node operation characteristics before the abnormality occurs.

In this embodiment, since the types of the original operation data of the supercomputer system are complex and the number of the original operation data is huge, the data of which the part has an association relationship with the fault exception needs to be selected to represent the operation characteristics. In the embodiment of the application, relevant operation characteristics are extracted from the obtained operation data of the super computing system. The Syslog system log contains restarting information of the nodes, and proves that node errors do occur, but abnormal information is not found before the faults occur; openSM the weblog does not find a direct association with the node failure either; node error events can be found in Slurm schedule logs, but also lack the predictive information of errors. The node monitoring index data contains a plurality of abnormal information, wherein part of the abnormal information and the node error show obvious relevance, and further specific analysis is needed. Referring to fig. 2, fig. 2 is an exemplary diagram showing a monitor index value plotted as a line graph of a time dimension, where, as shown in fig. 2, the horizontal axis represents time and the vertical axis represents a monitor value. In this figure, the point in time (upper horizontal axis) when the node fails is shown at the same time. In the graph, the average service time index of the disk is greatly improved before the node fails and is abnormal. As is obvious from the graph, before the node fails, there is a significant fluctuation in the monitoring index data. Therefore, there may be a great correlation between such a monitoring index that fluctuates significantly in data and the occurrence of a failure abnormality of a node. Taking the monitoring index of the type as a key monitoring index, and taking the key monitoring index as a selected object of the operation characteristics.

In some possible embodiments, the selection object of the operation feature may further include a node workload in addition to the monitoring index. The node workload characterizes the number of jobs running simultaneously on a node in the supercomputer system at a time.

In this embodiment, job information may be obtained from Slurm schedule logs. Specifically, a log template related to job scheduling is acquired, and the position of the variable "job number" (i.e., the ID of the job) contained therein is acquired. After the job number is obtained, a Slurm dispatch log is scanned, and the life cycle of each job is counted according to the job number. In the life cycle, each log containing the job number is counted, the first occurrence log time (job starting time) and the last occurrence log time (job ending time) of the log are recorded, and meanwhile the node allocation condition of the job is recorded. And counting the workload of the time point according to whether the life cycle of all the jobs of the target node comprises the queried time point. For example, there is a job with a job number of "1", each line of the log is scanned Slurm, the line in which the job number of "1" appears is recorded, and if 4 results exist, the 4 results are arranged in time sequence, the line in which the job number of "1" appears for the first time is taken as the start time of the job, and the line in which the job number of "1" appears for the last time is taken as the end time of the job. The life cycle of the job is the period of time from the start time of the job to the end time of the job.

In some possible embodiments, the selecting of the key monitoring indicator may include:

s131: and determining an associated span window, wherein the associated span window is used for indicating a fixed time before the occurrence time of the fault abnormality.

In this embodiment, in order to determine which monitoring indicators can be used to implement the association prediction of node failures, the duration of the association span window is first set, i.e., how long before the failure an indicator anomaly can be considered to be associated with the failure anomaly. In the embodiment of the present application, the association span window may be set to 90min.

S132: and counting the abnormal times of faults in the associated span window.

In this embodiment, the number of times that each index has been abnormal in the association window before all nodes fail is counted. With the fault anomaly noted as F, the fault anomaly set may be noted as f= { F ₁,f₂,...,f_n }. The occurrence time of the fault abnormality is marked as t, and the duration of the associated span window is marked as deltat. Then, when the index m fails to be abnormal within the associated span window Δt (i.e., within the period of t- Δt to t), the anomaly count value C _m is incremented by 1.

S133: based on the number of fault anomalies, a fault precursor probability is determined.

In this embodiment, after the number of fault anomalies is determined, the anomaly occurrence of the index m of all fault anomalies in the fault anomaly set may be counted, and the fault precursor probability of the index m is finally obtained and expressed as

S134: and under the condition that the failure precursor probability is larger than a set threshold value, determining the index m as a key monitoring index.

In the present embodiment, a second threshold t _f is set, and if the failure pre-symptom probability P _m is greater than the set second threshold t _f, the monitor index m is determined as a key monitor index.

By the method for selecting the key monitoring indexes, 27 of the 59 indexes in six categories can be determined to serve as the key monitoring indexes. For example, referring to table 2, table 2 shows specific index names included in different index categories, where the disc category includes a disc capacity category and a disc read/write category.

TABLE 2

In some possible embodiments, the default value t _f = 0.1 for the second threshold.

S14: based on the node operational characteristics, a feature vector of the node operational characteristics is determined.

In this embodiment, the operation feature data is converted into a representation form of multidimensional vectors, and each vector may represent a node state of a monitoring record time point. Specifically, a vector schema is established according to the key monitoring index data and the workload contained in the operation characteristic selection result. The vector schema is used to characterize node operation features, and refers to the schema of operation feature vectors, i.e. the meaning of setting each position value in a unified format. By way of example, table 3 shows a specific form of schema.

TABLE 3 Table 3

In the embodiment of the application, since the instantaneous abnormality is a common situation in the super computing system, the computing device can quickly complete self-repair and return to a normal working state. But if the anomaly is of a longer duration, the likelihood of the anomaly affecting node operation is significantly increased. Therefore, for each piece of monitor index data, it is necessary to express the operation state of the monitor index for a certain period of time in addition to the current monitor value. Meanwhile, the four characteristics of the offset value of the current monitoring value, the abnormal time point proportion (namely the abnormal times/120) of the index in 120 time points (namely 1 hour) before, the average offset value of the abnormal time points and the current abnormal duration (zero clearing if the current value is not abnormal) are added to each key monitoring index. In addition, the overall abnormal condition of the monitored values is added into the schema, and the overall abnormal condition comprises two characteristics of abnormal index proportion in the current 27 key monitored indexes and average offset value of the abnormal indexes. In terms of node workload, two features of the current node workload and the average load in the past 120 time points are added.

After determining the vector schema, the operational data may be converted into operational feature vectors based on the vector schema, each operational feature vector corresponding to an anomaly flag result. For anomaly tagging, for pre-fault precursor information, a point in time is tagged as 1 if that point in time is within the pre-fault associated window duration (Δt) and an anomaly indicator exists for that point in time. For any point in time, if its average offset value exceeds the product of the past average offset value and the threshold coefficient (i.e., outlier data), then that point in time is marked as 1. Other cases are marked 0. The data corresponding to the time point marked as 1 is the abnormal marking result. The operation feature vectors can be in one-to-one correspondence with the marking results through the time points.

In some possible embodiments, converting the operational data into operational feature vectors may include the steps of:

S141: for the operation data of any node N in the super computing system, a past offset value record table is established, and a past node operation load record table is established.

In this embodiment, for any node N in the supercomputing system, a past offset value record table with a capacity of 120×27 may be established for recording offset values of the nodes. Meanwhile, a past node workload record table with the capacity of 120 is established and is used for recording the node workload.

S142: and sequentially scanning key index monitoring values of each monitoring data generation time point, and calculating all required operation characteristic values in the schema.

In this embodiment, the past data may be obtained according to statistics of the data in the record table in S142. And simultaneously, inputting the ID of the time point and the ID of the node into the row vector to finish the conversion of the operation feature vector. For each key monitoring index, if the current offset value is an abnormal value, adding the current offset value to the tail of the corresponding row of the past offset value record table, and if the current offset value does not belong to the abnormal value, adding 0 (representing normal) to the tail of the record table. If the number of record tables exceeds the capacity, the data in the earliest joined table is removed.

S15: based on the feature vectors, a predictive model is trained.

In this embodiment, a plurality of machine learning methods are used to build a failure and abnormality prediction model for normal and abnormality classification, the running feature vector is used to train the input of the model, the abnormality flag is used as the output of the model, and the result is compared to select the model with the optimal effect among different models as the prediction model.

S16: and inputting real-time operation data of each node operation of the super computing system into a prediction model to predict the operation state of the system.

In this embodiment, after the training of the prediction model is completed, the obtained real-time operation data of the operation of the supercomputer system may be input into the prediction model, so that the situation that the supercomputer system will have fault abnormality may be predicted.

The above is an introduction of the anomaly prediction method provided by the embodiment of the application, by processing the log and the monitoring data generated by the operation of each node in the system, obtaining the anomaly data therein, determining the node operation characteristics before the occurrence of the anomaly, determining the characteristic vector of the operation characteristics, and training the prediction model through the characteristic vector to obtain the trained prediction model. And inputting real-time data of node operation into the prediction model, so that the system operation state can be predicted, and the possibility of failure of the node in the super-computing system is deduced.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not limit the implementation process of the embodiments of the present application. In addition, in some possible implementations, each step in the foregoing embodiments may be selectively performed according to practical situations, and may be partially performed or may be performed entirely, which is not limited herein. All or part of any features of any embodiment of the application may be freely combined without contradiction. The combined technical scheme is also within the scope of the application.

Based on the method in the above embodiment, the embodiment of the present application further provides an anomaly prediction device.

Illustratively, FIG. 3 shows an apparatus for exception prediction. As shown in fig. 3, the apparatus 300 for exception prediction may include: an acquisition module 301 and a processing module 302. The acquiring module 301 is configured to acquire an operation log and system monitoring data of each node, and perform preliminary processing on the operation log and the monitoring data; extracting and marking the abnormal information recorded in each primarily processed running log; and classifying each piece of primarily processed monitoring data. The processing module 302 is configured to determine node operation features before an abnormality occurs based on the classified monitoring data and the extracted abnormality information, and determine feature vectors of the node operation features; based on the feature vectors, a predictive model is determined. The processing module 302 is further configured to input real-time operation data of each node into a prediction model, and predict a system operation state of each node.

It should be understood that, the foregoing apparatus is used to perform the method in the foregoing embodiment, and corresponding program modules in the apparatus implement principles and technical effects similar to those described in the foregoing method, and reference may be made to corresponding processes in the foregoing method for the working process of the apparatus, which are not repeated herein.

The present application also provides a computing device 400. As shown in fig. 4, the computing device 400 includes: bus 402, processor 404, memory 406, and communication interface 408. Communication between processor 404, memory 406, and communication interface 408 is via bus 402. Computing device 400 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 400.

Bus 402 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 4, but not only one bus or one type of bus. Bus 404 may include a path to transfer information between various components of computing device 400 (e.g., memory 406, processor 404, communication interface 408).

The processor 404 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).

Memory 406 may include volatile memory (RAM), such as random access memory (random access memory). The processor 404 may also include non-volatile memory (ROM), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD), or Solid State Disk (SSD).

The memory 406 stores executable program code that the processor 404 executes to implement the functions of the aforementioned acquisition module 301 and processing module 302, respectively, to implement all or part of the steps of the methods in the aforementioned embodiments. That is, the memory 406 has instructions stored thereon for performing all or part of the steps of the methods of the embodiments described above.

Or the memory 406 may store executable code that is executed by the processor 404 to implement the functions of the apparatus 300 for anomaly prediction as described above, respectively, to implement all or a portion of the steps in the methods of the embodiments described above. That is, the memory 406 has instructions stored thereon for performing all or part of the steps of the methods of the embodiments described above.

Communication interface 408 enables communication between computing device 400 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.

Based on the methods in the above embodiments, the embodiments of the present application provide a computer-readable storage medium storing a computer program that, when executed on a processor, causes the processor to perform the methods in the above embodiments.

Based on the methods in the above embodiments, embodiments of the present application provide a computer program product that, when run on a processor, causes the processor to perform the methods in the above embodiments.

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application Specific Integrated Circuits (ASICs), field programmable gate arrays (field programmable GATE ARRAY, FPGAs), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable ROM (erasable PROM, EPROM), electrically Erasable Programmable ROM (EEPROM), registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Drive (SSD)), etc.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Claims

1. A method for predicting failure anomalies of nodes of a super computing system, which is applied to a super computing system cluster, wherein the super computing system cluster comprises a plurality of nodes, the method comprising:

Acquiring operation logs and system monitoring data of each node, and performing preliminary processing on the operation logs and the monitoring data; extracting and marking the abnormal information recorded in each primarily processed running log; classifying each of the primarily processed monitoring data;

determining node operation characteristics before abnormality based on the classified monitoring data and the extracted abnormality information, and determining feature vectors of the node operation characteristics; determining a predictive model based on the feature vector;

And inputting the real-time operation data of each node into the prediction model, and predicting the system operation state of each node.

2. The method of claim 1, wherein the preliminary processing of the travel log and the monitoring data comprises:

And acquiring templates contained in the log information through a log classification method.

3. The method of claim 1, wherein classifying each of the preliminary processed monitoring data comprises:

Clustering the monitoring data to obtain a normal value of the monitoring data;

determining an offset value of each monitoring data; the offset value is used for representing the offset degree of the monitoring data compared with the normal value of the monitoring data;

based on the offset value, abnormal data included in the monitor data is determined.

4. A method according to claim 3, wherein said determining anomaly data included in the monitored data comprises:

determining the monitoring data conforming to the following formula as abnormal data, wherein the formula is expressed as:

ab_N＞pab_N×t_a

Where ab _N represents an average offset value of node N among the plurality of nodes, pab _N represents an average offset value within a fixed period of time before the current time, and t _a represents a coefficient of a preset first threshold, where the coefficient of the first threshold is 1.5.

5. The method of claim 1, wherein determining the node operational characteristics before the anomaly occurred comprises:

and determining the node operation characteristics of the key monitoring indexes before the occurrence of the abnormality.

6. The method of claim 5, wherein the selecting of the key monitoring indicator comprises:

determining an associated span window, wherein the associated span window is used for indicating a fixed time period before the occurrence time of the fault abnormality;

counting the occurrence times of fault anomalies in the association span window;

Determining the failure precursor probability of the index m based on the occurrence times of failure abnormality;

And determining the index m as a key monitoring index under the condition that the failure precursor probability is larger than a second threshold value.

7. The method of claim 6, wherein the second threshold is 0.1.

8. The method of claim 1, wherein said determining a feature vector for the node operational feature comprises:

determining a vector schema according to the monitoring index data and the operation load of the node, wherein the vector schema is used for representing the node operation characteristics;

the node operational characteristics are converted into operational characteristic vectors based on the vector schema.

9. The method of claim 1, wherein the system monitor data is monitor data for system monitor metrics including processor, disk capacity, memory usage, disk read and write, network transmission, and InfiniBand conditions.

10. A device for predicting failure anomalies of nodes of a super computing system, the device being deployed in a super computing system cluster, the super computing system cluster comprising a plurality of nodes, the device comprising:

The acquisition module is used for acquiring the operation log and the system monitoring data of each node and carrying out preliminary processing on the operation log and the monitoring data; extracting and marking the abnormal information recorded in each primarily processed running log; classifying each of the primarily processed monitoring data;

The processing module is used for determining node operation characteristics before abnormality based on the classified monitoring data and the extracted abnormality information and determining characteristic vectors of the node operation characteristics; determining a predictive model based on the feature vector;

The processing module is also used for inputting the real-time operation data of each node into the prediction model and predicting the system operation state of each node.