CN109101395A

CN109101395A - A kind of High Performance Computing Cluster application monitoring method and system based on LSTM

Info

Publication number: CN109101395A
Application number: CN201810841868.2A
Authority: CN
Inventors: 胡辰
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2018-12-28

Abstract

The present invention provides a kind of High Performance Computing Cluster application monitoring method and system based on LSTM, which comprises acquires the data of each calculate node；The data are subjected to threshold value pretreatment and normalized；Data training after threshold value pretreatment and normalized is formed into depth network LSTM；High Performance Computing Cluster application predicting abnormality will be carried out in data input depth network LSTM after threshold value pretreatment and normalized.The present invention can carry out auxiliary monitoring to current high performance computing cluster application program, effectively promote the accuracy rate for judging High Performance Computing Cluster application operating status, hence it is evident that improve High-Performance Computing Cluster application program operation controllability and stability.

Description

A kind of High Performance Computing Cluster application monitoring method and system based on LSTM

Technical field

The present invention relates to High Performance Computing field more particularly to a kind of High Performance Computing Cluster applications based on LSTM Monitoring method and system.

Background technique

High-performance calculation is a branch of computer science, is related to the research of parallel algorithm, the exploitation of related software And the development of high-performance computer.With the development of science and technology, high-performance calculation has been deep into scientific research and state's people's livelihood Different field living, effect and importance are more and more obvious.High Performance Computing Cluster passes through various interconnection techniques for multiple meters Calculation machine system links together, and improves the arithmetic speed of total system, can achieve the even higher floating-point meter of trillion times per second Calculation ability, High Performance Computing Cluster have very high want to the processor of system, memory bandwidth, storage, system I/O etc. It asks, has in fields such as weather forecast, molecular simulation, fluid emulation, gene sequencing, bio-pharmaceuticals and deep learnings and widely answer With.

The application software of high-performance calculation is numerous, but most of high-performance calculation application has to computing system performance It is required that the features such as high and operation time is long, such as meteorological numerical forecast business mainly still rely on high-performance computer completion, Need powerful calculated performance, the short then dozens of minutes of runing time, even longer time long then a few hours.Such as molecule power again Simulation is learned, other than having high requirements to high performance computer network, one-stop operation Runtime is even up to a couple of days.So And the misoperation of program is caused for some reason sometimes, example using in prolonged operational process in high-performance calculation Such as abnormal to exit and run slack-off, in the operation initial stage, user is easy for can be according to oneself micro-judgment program Whether normal operation, and when program normally start operation after, usual user will not pay close attention to always the operating status of application, work as journey Sequence is difficult to find in time during operation when something goes wrong, only when user temporarily checks operation progress or according to outbound When the empirical value of sort run time goes to obtain result data, it is likely to note abnormalities, the possibility however this distance applications goes wrong Pass by a very long time, the progress of extreme influence business, so effective monitoring high-performance calculation application was run The problems in journey, note abnormalities situation in time, can effectively avoid because influencing caused by finding the problem not in time.

Currently, prior art exception running for program exits, and it is relatively easy to monitor, it can be by receiving program After exception exits code realization and exits monitoring to the exception of program, while program exception exits, the operation for carrying the program will also be moved back Out, user can get job logging in time and note abnormalities information, and the exiting of operation also implies that the operation using calculating money The release in source does not have the case where generation wastes to computing resource.

It is slack-off simultaneously for the running exception of program, it may cause there are many kinds of different factors, such as common network Network delay caused by congestion increases, other programs to memory bandwidth seize cause to become smaller using the memory bandwidth that can occupy and Decline of computing capability caused by cpu temperature is excessively high etc..For these common factors, common treating method is setting threshold value, Think that present procedure is operating normally within threshold value, and think that present procedure is operating abnormally except threshold value, in setting threshold value When, usual settable a variety of different types of metrics-thresholds.

However, for setting threshold decision program whether Yi Chang mechanism there are the following problems:

(1) very professional industry field experience, excessive, the easy omission of threshold range setting are needed when threshold value is arranged The program of misoperation, threshold range are arranged too small, may judge the program of normal operation by accident.

(2) due in different time sections, application program index it is possible that different range value, and threshold value ordinary circumstance Under be it is constant, be difficult to be adapted to whole process when program operation.

(3) index parameter of application program is numerous, and each index setting threshold operative amount is very big, multiple indexs it is interior It can not be also configured simply by threshold value contacting and influencing each other.

(4) for program be operating abnormally judgement not only from current time information, information for the previous period Program can may also be run and be had an impact, this is unable to satisfy by the threshold decision at current time.

Summary of the invention

High Performance Computing Cluster application monitoring method and system provided by the invention based on LSTM, can be to current high property Energy computing cluster application program carries out auxiliary monitoring, is effectively promoted and judges the accurate of High Performance Computing Cluster application operating status Rate, hence it is evident that improve High-Performance Computing Cluster application program operation controllability and stability.

In a first aspect, the present invention provides a kind of High Performance Computing Cluster application monitoring method based on LSTM, comprising:

Acquire the data of each calculate node；

The data are subjected to threshold value pretreatment and normalized；

Data training after threshold value pretreatment and normalized is formed into depth network LSTM；

High-performance calculation collection will be carried out in data input depth network LSTM after threshold value pretreatment and normalized Group applies predicting abnormality.

Optionally, the data of each calculate node of acquisition include:

Acquire the data of each calculate node；

Data collected are aggregated into management node by socket, and carry out the storage of data by management node.

Optionally, described to include: by data progress threshold value pretreatment and normalized

Judge that the data for acquiring and handling in temporal sequence, will be in first threshold range whether in first threshold range Data execute normalized；Or it will exceed the data of first threshold range as abnormal data；

Judge that whether within the scope of second threshold, the data execution within the scope of second threshold is returned for the single data obtained One change processing；Or it will exceed the data execution application alarm of first threshold range.

Optionally, the data training by after threshold value pretreatment and normalized forms depth network LSTM packet It includes:

Using the data after threshold value pretreatment and normalized and in first threshold range as positive sample；

Increase previously fabricated program exception operating condition and be formed by negative sample and establishes positive and negative sample set；

The training of positive and negative sample set is formed into depth network LSTM.

Optionally, in the single data input depth network LSTM by after threshold value pretreatment and normalized After carrying out High Performance Computing Cluster application predicting abnormality, the method also includes:

When predicting the High Performance Computing Cluster application exception, application alarm is executed.

Second aspect, the present invention provide a kind of High Performance Computing Cluster application monitoring system based on LSTM, comprising:

Acquisition module, for acquiring the data of each calculate node；

Data processing module, for the data to be carried out threshold value pretreatment and normalized；

Training module, for the data training after threshold value pretreatment and normalized to be formed depth network LSTM；

Predicting abnormality module inputs depth network with the single data after normalized for that will pre-process through threshold value High Performance Computing Cluster application predicting abnormality is carried out in LSTM.

Optionally, the acquisition module includes:

Data acquisition unit, for acquiring the data of each calculate node；

Data storage cell, for data collected to be aggregated into management node by socket, and by management node Carry out the storage of data.

Optionally, the data processing module includes:

Threshold value pretreatment unit, for judging the data for acquiring and handling in temporal sequence whether in first threshold range It is interior, and judge the single data obtained whether within the scope of second threshold；

Normalized unit, for the data in first threshold range to be executed normalized, and will be the Data in two threshold ranges execute normalized；

Data exception unit, for will exceed the data of first threshold range as abnormal data.

Optionally, the training module includes:

Positive sample forms unit, for that will pre-process and the number after normalized and in first threshold range through threshold value According to being arranged and being marked, positive sample is formed；

Adding unit is formed by negative sample for increasing previously fabricated program exception operating condition and establishes positive negative sample Collection；

Training unit, for the training of positive and negative sample set to be formed depth network LSTM.

Optionally, the system also includes:

Using alarm module, for executing application alarm.

High Performance Computing Cluster application monitoring method and system provided in an embodiment of the present invention based on LSTM, the method By the way that the data of each calculate node are acquired and are summarized, it is more apparent abnormal not only discovery can be pre-processed by threshold value Value, while may also pass through normalized, and carry out to the data after threshold value pretreatment and normalized using LSTM Learning training forms depth network LSTM, sufficiently excavates the relationship between data collected and application operating status；And then it digs The sample characteristics in a period of time sequence are excavated, auxiliary monitoring is carried out to current high performance computing cluster application program, is effectively mentioned Rise the accuracy rate for judging High Performance Computing Cluster application operating status, hence it is evident that improve High-Performance Computing Cluster application program and run controllability And stability.

Detailed description of the invention

Fig. 1 is the flow chart of High Performance Computing Cluster application monitoring method of the one embodiment of the invention based on LSTM；

Fig. 2 is the architecture diagram of another embodiment of the present invention data acquisition；

Fig. 3 is the flow chart of another embodiment of the present invention depth network LSTM data training；

Fig. 4 is the flow chart of High Performance Computing Cluster application monitoring method of the another embodiment of the present invention based on LSTM；

Fig. 5 is the structural schematic diagram of High Performance Computing Cluster application monitoring system of the one embodiment of the invention based on LSTM.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

The embodiment of the present invention provides one kind based on LSTM (shot and long term memory network: Long Short-Term Memory) High Performance Computing Cluster application monitoring method, as shown in Figure 1, which comprises

The data of S11, each calculate node of acquisition；

S12, the data are subjected to threshold value pretreatment and normalized；

Optionally, the data are first subjected to threshold value pretreatment, be then normalized again；Alternatively, by the number According to being first normalized, threshold value pretreatment is then carried out again.

S13, the data training after threshold value pretreatment and normalized is formed into depth network LSTM；

S14, high-performance meter will be carried out in the data input depth network LSTM after threshold value pretreatment and normalized Calculate cluster application predicting abnormality.

High Performance Computing Cluster application monitoring method provided in an embodiment of the present invention based on LSTM is by saving each calculating The data of point are acquired and summarize, and can not only be pre-processed by threshold value and find more apparent exceptional value, while can be with By normalized, and learning training is carried out to the data after threshold value pretreatment and normalized using LSTM and forms depth Network LSTM is spent, the relationship between data collected and application operating status is sufficiently excavated；And then excavate a period of time sequence Sample characteristics in column carry out auxiliary monitoring to current high performance computing cluster application program, are effectively promoted and judge high-performance meter Calculate the accuracy rate of cluster application operating status, hence it is evident that improve High-Performance Computing Cluster application program operation controllability and stability.

Also, foundation is the experimental results showed that carry out WRF (weather forecast mode: The by the method for combining deep learning Weather Research and Forecasting Model) the abnormal monitoring accuracy rate of application program is obviously improved.By Numerous in high performance application type complexity, the WRF high performance software relatively conventional mainly for meteorological field is tested in experiment Carry out abnormal monitoring.

Optionally, as shown in Fig. 2, the data of each calculate node of acquisition include:

Acquire the data of each calculate node；

Specifically, since a usual high-performance calculation application can be related to multiple and different calculate nodes, regardless of which The operation of calculate node occurs abnormal, may all have an impact to application program, at the same the network between calculate node whether Meet application demand to be also required to take into account.Therefore, according to problem above, the present embodiment the method passes through metadata acquisition tool (for example, being ordered using iostat, sar, top and nvidia-smi etc.) is acquired the data of each calculate node, for example, Every 10s acquisition is primary.Then collected data are aggregated into management node by socket (" socket "), and are saved by management The storage and terminal for clicking through row data are shown；And then it realizes and unified acquisition is carried out to computing cluster parameter, collects and deposits Storage is conducive to the accuracy for improving computing cluster application monitoring convenient for data collected are handled and judged.

Optionally, the data of each calculate node include: CPU information, memory information, local hard drive information, network letter One or any combination in breath, shared storage information and GPU information.

Optionally, as shown in Fig. 2, the CPU information includes that vector quantization utilization rate, core utilization rate, DIE Temperature are gentle Deposit one or any combination in hit rate.

Alternatively, the memory information includes a kind of in memory usage, memory broadband and exchange partition utilization rate or appoints Meaning combination.

Alternatively, the local hard drive information includes, hard disk reads broadband, hard disk writes broadband, hard disk reads to wait and hard disk writes waiting In one or any combination.

Alternatively, the network information includes Ethernet receiving velocity, Ethernet transmission rate, Infiniband receiving velocity With combination one or any in Infiniband transmission rate.

Alternatively, the shared storage information includes data write rate, data read-out speed, packet sending speed and reads packet rate In one or any combination.

Alternatively, the GPU information includes GPU utilization rate, one or any combination in video memory utilization rate and GPU temperature.

Optionally, as shown in Figure 3 and Figure 4, described to include: by data progress threshold value pretreatment, normalized

Optionally, as shown in figure 3, the data for acquiring and handling in temporal sequence include that acquisition data chronologically carry out Divide or temporally carry out series processing；Wherein, the data for acquiring and handling in temporal sequence need to as training sample into Row LSTM training.

Optionally, the data in first threshold range execute following processing:

Obtain hardware parameter rated value；

Special parameter processing (such as: the overclocking of CPU)；

Normalized.

Judge that whether within the scope of second threshold, the data execution within the scope of second threshold is returned for the single data obtained One change processing；Or it will exceed the data execution application alarm of first threshold range；

Wherein, the single data of the acquisition are the data that need to be judged extremely.

Optionally, the first threshold range is set as different from the second threshold range；

Or the first threshold range is set as identical with the second threshold range.

Specifically, due to collect each calculate node data in may include obviously exceptional value, can be to the later period Processing produce bigger effect, therefore, the present embodiment the method by threshold value pretreatment carried out basic abnormality processing, in threshold It is worth the program that range is bigger, and erroneous judgement is avoided to operate normally, the sample data in threshold range executes normalized, is more than The sample data of threshold range is determined as that program is operating abnormally.

Further, since an application program may use and arrive different calculate nodes, and needed for each calculate node Hardware resource may be different, and for this species diversity, the present embodiment the method unites the parameter corresponding data of all categories One, including maximum value, minimum value and mean value etc..Since different parameters value has a different orders of magnitude, for example cpu temperature value is general It is tens orders of magnitude, memory usage is order of magnitude less than 1 etc., in order to reduce the influence between the order of magnitude, by all parameters Corresponding data has carried out normalized.

Data after threshold value pretreatment and normalized and in first threshold range are arranged and marked, shape At positive sample；

Specifically, the present embodiment the method divides each training sample in order to the information in abundant mining data Not taking preceding 10 minutes data is the characteristic of the sample, at first 10 minutes of program operation, is defaulted as program and is carrying out just Beginningization operation, participates in training not as training sample, carries out data by the classification of sample of the operating status of current time program Mark forms positive sample.

Specifically, in program operation, artificial fabrication schedule abnormal operating condition, such as forcible occupying memory bandwidth, net Network bandwidth, CPU etc. finally establish positive and negative sample set to increase negative sample quantity with positive sample.

Specifically, the present embodiment the method is in order to sufficiently excavate the parameter of the connection between parameter, different moments to journey The influence of sort run state regard 70% in positive and negative sample set as training data, and 30% is used as test data, utilizes LSTM depth Degree network is trained training data, obtains final accuracy rate by the test to test data.

Optionally, as shown in figure 4, inputting depth in the single data by after threshold value pretreatment and normalized After carrying out High Performance Computing Cluster application predicting abnormality in network LSTM, the method also includes:

In conclusion the present embodiment the method can carry out the collected information of High Performance Computing Cluster more fully It excavates, unified acquisition, collection and storage can be carried out to computing cluster data, be judged in advance by threshold value, discovery is obvious Program be operating abnormally, the data except threshold range pre-processed, data preparation etc., finally utilize depth network LSTM excavates the sample characteristics in a period of time sequence, and then carries out auxiliary monitoring to present procedure, effective to promote judgement The accuracy rate of High Performance Computing Cluster application program whether normal operation.

The embodiment of the present invention also provides a kind of High Performance Computing Cluster application monitoring system based on LSTM, as shown in figure 5, The system comprises:

Acquisition module 11, for acquiring the data of each calculate node；

Data processing module 12, for the data to be carried out threshold value pretreatment and normalized；

Training module 13, for the data training after threshold value pretreatment and normalized to be formed depth network LSTM；

Predicting abnormality module 14 inputs depth network with the single data after normalized for that will pre-process through threshold value High Performance Computing Cluster application predicting abnormality is carried out in LSTM.

High Performance Computing Cluster application monitoring system provided in an embodiment of the present invention based on LSTM will by acquisition module The data of each calculate node are acquired and summarize, and can not only be pre-processed by threshold value and find more apparent exceptional value, together When can also by training module will pass through normalized, and using LSTM to through threshold value pretreatment and normalized after number Depth network LSTM is formed according to learning training is carried out, sufficiently excavates the relationship between data collected and application operating status； And then the sample characteristics in a period of time sequence are excavated, by predicting abnormality module to current high performance computing cluster application journey Sequence carries out auxiliary monitoring, effectively promotes the accuracy rate for judging High Performance Computing Cluster application operating status, hence it is evident that improves high-performance Cluster application program runs controllability and stability.

Optionally, the acquisition module 11 includes:

Data acquisition unit, for acquiring the data of each calculate node；

Optionally, the data processing module 12 includes:

Optionally, the training module 13 includes:

Optionally, the system also includes:

Using alarm module, for executing application alarm.

The system of the present embodiment can be used for executing the technical solution of above method embodiment, realization principle and technology Effect is similar, and details are not described herein again.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of High Performance Computing Cluster application monitoring method based on LSTM characterized by comprising

Acquire the data of each calculate node；

The data are subjected to threshold value pretreatment and normalized；

It is answered High Performance Computing Cluster is carried out in the data input depth network LSTM after threshold value pretreatment and normalized Use predicting abnormality.

2. the method according to claim 1, wherein the data of each calculate node of acquisition include:

Acquire the data of each calculate node；

3. method according to claim 1 or 2, which is characterized in that described that the data are carried out threshold value pretreatment and are returned One change is handled

The data for acquiring and handling in temporal sequence are judged whether in first threshold range, by the number in first threshold range According to execution normalized；Or it will exceed the data of first threshold range as abnormal data；

Judge that the data within the scope of second threshold whether within the scope of second threshold, are executed normalization by the single data obtained Processing；Or it will exceed the data execution application alarm of first threshold range.

4. method according to claim 1 or 2, which is characterized in that it is described will through threshold value pre-process and normalized after Data training form depth network LSTM and include:

Data after threshold value pretreatment and normalized and in first threshold range are arranged and marked, are formed just Sample；

5. method according to claim 1 or 2, which is characterized in that it is described will through threshold value pre-process and normalized After carrying out High Performance Computing Cluster application predicting abnormality in single data input depth network LSTM afterwards, the method is also wrapped It includes:

6. a kind of High Performance Computing Cluster application monitoring system based on LSTM characterized by comprising

Acquisition module, for acquiring the data of each calculate node；

Predicting abnormality module inputs in depth network LSTM for that will pre-process through threshold value with the single data after normalized Carry out High Performance Computing Cluster application predicting abnormality.

7. system according to claim 6, which is characterized in that the acquisition module includes:

Data acquisition unit, for acquiring the data of each calculate node；

Data storage cell for data collected to be aggregated into management node by socket, and is carried out by management node The storage of data.

8. system according to claim 6 or 7, which is characterized in that the data processing module includes:

Threshold value pretreatment unit, for judge the data for acquiring and handling in temporal sequence whether in first threshold range, with And judge the single data obtained whether within the scope of second threshold；

Normalized unit and will be in the second threshold for the data in first threshold range to be executed normalized The data being worth in range execute normalized；

9. system according to claim 6 or 7, which is characterized in that the training module includes:

Positive sample forms unit, for by after threshold value pretreatment and normalized and data in first threshold range into Row arranges and mark, forms positive sample；

Adding unit is formed by negative sample for increasing previously fabricated program exception operating condition and establishes positive and negative sample set；

10. system according to claim 6 or 7, which is characterized in that the system also includes:

Using alarm module, for executing application alarm.