CN111352820A

CN111352820A - Method, equipment and device for predicting and monitoring running state of high-performance application

Info

Publication number: CN111352820A
Application number: CN202010154757.1A
Authority: CN
Inventors: 李龙翔; 刘羽; 杨振宇; 于占乐; 王倩
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-03-08
Filing date: 2020-03-08
Publication date: 2020-06-30

Abstract

The invention provides a method, equipment and a device for predicting and monitoring a running state of a high-performance application, wherein the method comprises the following steps: collecting a system log and an operation log generated in the operation of a target platform, sequencing messages in the system log and the operation log according to time, corresponding entries with the same time, and storing the entries as an intermediate data file; extracting key information of the intermediate data file by adopting a natural semantic processing tool in data mining, and marking character information in the extracted key information by using a corresponding digital feature vector; and respectively analyzing the numbers, the time and the text information marked by the digital feature vectors in the intermediate data file through a model trained through a machine learning algorithm, and judging the running state of the application based on the analysis result. The invention can provide the application running state in real time, improve the utilization rate of platform computing resources and reduce the queuing waiting time of computing tasks of users.

Description

Method, equipment and device for predicting and monitoring running state of high-performance application

Technical Field

The present invention relates to the field of computers, and more particularly, to a method, device and apparatus for predicting and monitoring an operating state of a high-performance application.

Background

A high performance or supercomputing (HPC) cluster is a computer with very large computational performance and scale, and programs running on such a computing cluster typically use parallel algorithms to solve complex computational problems by dividing the computational task into many small problems. As computing demands for different applications have increased, more and more computing applications have begun to be solved using high performance computers. The method has the advantages that the application running state is accurately judged, the application running time is predicted, and the like, and the method plays an important role in maintaining the high-performance cluster, can effectively improve the platform running efficiency, reduces the queuing waiting time of users, and improves the user experience. However, as the scale of high-performance computers increases during the daily operation of cloud computing or supercomputing platforms, the challenge of maintaining the normal operation of the computers also increases. The difficulty in maintenance is not only the large amount of synchronized data that the system is producing every moment, but also the difficulty in analyzing the data to obtain useful information about the operating conditions of the system. In addition, since different applications may generate a large amount of information during running, such as different job logs and application logs, the conventional manual method needs personnel to have certain basic knowledge of recalculation and application when determining the running state of the application. However, the manual method cannot analyze the mass data generated by the platform in time, so that the operation conditions of the applications on different nodes of the current platform cannot be judged in time.

At present, a plurality of system automation operation and maintenance tools are provided, and more mature schemes comprise schemes based on statistical methods, machine learning methods and the like. In the statistical-based method, an anomaly score is given by testing the test data, and if the anomaly score is higher than a threshold value, it is considered as an anomaly point. The method can provide more accurate prediction on the premise of setting a proper threshold value and adjusting parameters. However, although this statistically based abnormality detection method can provide a more accurate prediction on the premise that an appropriate threshold value is set and parameters are adjusted, it is very difficult to adjust the threshold value and the parameters. In addition, each variable is assumed to satisfy statistical distribution, and most training schemes also rely on an assumption process, which is not in accordance with the limit in the practical application process.

The second category belongs to a machine learning-based method, and mainly comprises a classification algorithm and a clustering algorithm. The classification algorithm is a supervised machine learning algorithm, and the necessary premise is that the class to which the classification data included in the training set belongs is known. The clustering algorithm is an unsupervised machine learning algorithm, and is generally used for clustering sample data based on distance to identify abnormal points, but the method has the defect that faults which do not appear in training samples cannot be pre-warned. At present, machine learning methods are used for assisting system anomaly detection, and most of the methods are used for analyzing a single log file. In the running process of the high-performance application, the normal running of the program depends on the normal running of the platform operating system, the job scheduling system and the application. The application running state cannot be comprehensively judged only by using a single log file.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method, a device, and a device for predicting and monitoring an operating state of a high-performance application, which implement real-time analysis on log files at different levels in an operating process of the high-performance application by combining data mining and machine learning, so as to improve task scheduling and utilization of a high-performance platform.

Based on the above object, an aspect of the embodiments of the present invention provides a method for predicting and monitoring an operation state of a high-performance application, including the following steps:

collecting a system log and an operation log generated in the operation of a target platform, sequencing messages in the system log and the operation log according to time, corresponding entries with the same time, and storing the entries as an intermediate data file;

extracting key information of the intermediate data file by adopting a natural semantic processing tool in data mining, and marking character information in the extracted key information by using a corresponding digital feature vector;

and respectively analyzing the numbers, the time and the text information marked by the digital feature vectors in the intermediate data file through a model trained through a machine learning algorithm, and judging the running state of the application based on the analysis result.

In some embodiments, the application execution state includes: normal operation, user termination, node error, and run timeout.

In some embodiments, the extracting key information of the intermediate data file by using a natural semantic processing tool in data mining, and labeling text information in the extracted key information with a corresponding digital feature vector includes:

and extracting the key information in the intermediate data file by adopting a topic model LDA method in text modeling, and taking the probability distribution of the extracted key information as a characteristic vector of the key information.

In some embodiments, the analyzing the numbers, the time and the text information labeled by the digital feature vectors in the intermediate data file through a model trained through a machine learning algorithm, and the determining the application running state based on the analysis result includes:

and receiving the log file which is existed in the application and is processed by the preprocessing module and the data analysis module and the corresponding running state thereof as training data, and training the model by a machine learning algorithm.

Another aspect of the embodiments of the present invention provides a high performance application running state prediction and monitoring device, including:

the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is configured to collect a system log and an operation log generated in the running of a target platform, sort the messages in the system log and the operation log according to time, correspond the entries with the same time and store the corresponding entries as an intermediate data file;

the data analysis module is configured to extract key information of the intermediate data file by adopting a natural semantic processing tool in data mining, and mark character information in the extracted key information by using a corresponding digital feature vector;

and the automatic monitoring module is configured to analyze the numbers, the time and the text information marked by the digital feature vectors in the intermediate data file through a model trained through a machine learning algorithm, and judge the application running state based on the analysis result.

In some embodiments, the automated monitoring module is configured to:

In some embodiments, the machine learning algorithm comprises: decision trees, random forests, artificial neural networks, bayesian learning.

In some embodiments, the operational state includes: normal operation, user termination, node error, and run timeout.

In some embodiments, the data analysis module is further configured to:

Another aspect of the embodiments of the present invention provides a high performance application running state predicting and monitoring apparatus, including:

at least one processor; and

a memory storing program code executable by the processor, the program code implementing the method of any of the above when executed by the processor.

The invention has the following beneficial technical effects: according to the method, the device and the device for predicting and monitoring the running state of the high-performance application, provided by the embodiment of the invention, on a large-scale high-performance computing platform, by means of a data mining and machine learning model, a computer can automatically judge the running state of the high-performance application, so that the manual pressure is reduced; by using the data mining method, the application running state can be provided in real time, the utilization rate of platform computing resources is improved, and the queuing waiting time of computing tasks of users is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a flow chart of a high performance application run state prediction and monitoring method according to the present invention;

FIG. 2 is a flow chart of automated monitoring and forecasting by the high performance application operating condition forecasting and monitoring device of the present invention;

fig. 3 is a schematic diagram of a hardware configuration of a high-performance application operation state prediction and monitoring apparatus according to the present invention.

Detailed Description

Embodiments of the present invention are described below. However, it is to be understood that the disclosed embodiments are merely examples and that other embodiments may take various and alternative forms. The figures are not necessarily to scale; certain features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention. As one of ordinary skill in the art will appreciate, various features illustrated and described with reference to any one of the figures may be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combination of features shown provides a representative embodiment for a typical application. However, various combinations and modifications of the features consistent with the teachings of the present invention may be desired for certain specific applications or implementations.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

In the running process of the high-performance application, the normal running of a target platform operating system is not only relied on, but also the normal work of a job scheduling system (such as Slurm, Moab and the like) is required. In the operation and maintenance process of the high-performance cluster, management personnel are required to monitor the state of the cluster system all the time, and meanwhile, the operation state of the operation scheduling system in the operation process of the high-performance application is concerned, so that errors in the operation of the application are avoided. In maintaining large-scale clusters, it becomes difficult to analyze the large amounts of synchronized data that are generated by the system at all times due to the need to monitor the operational status of the applications around the clock to obtain useful information about the operational status of the system. However, by using big data and artificial intelligence technology, the system and the job logs can be automatically analyzed, the real-time analysis and prediction of the running states of the system and the job scheduling system are realized, and the normal running of the cluster system is guaranteed.

In view of the above, an aspect of the embodiments of the present invention provides a method for predicting and monitoring an operation state of a high performance application, as shown in fig. 1, including the following steps:

step S101: collecting a system log and an operation log generated in the operation of a target platform, sequencing messages in the system log and the operation log according to time, corresponding entries with the same time, and storing the entries as an intermediate data file;

step S102: extracting key information of the intermediate data file by adopting a natural semantic processing tool in data mining, and marking character information in the extracted key information by using a corresponding digital feature vector;

step S103: and respectively analyzing the numbers, the time and the text information marked by the digital feature vectors in the intermediate data file through a model trained through a machine learning algorithm, and judging the running state of the application based on the analysis result.

In some embodiments, the application running state comprises: normal operation, user termination, node error, and run timeout.

In some embodiments, the extracting key information of the intermediate data file by using a natural semantic processing tool in data mining, and labeling text information in the extracted key information with a corresponding digital feature vector includes: and extracting the key information in the intermediate data file by adopting a topic model LDA method in text modeling, and taking the probability distribution of the extracted key information as a characteristic vector of the key information.

In some embodiments, the analyzing the numbers, the time and the text information labeled by the digital feature vectors in the intermediate data file through the model trained through the machine learning algorithm, respectively, and the determining the application running state based on the analysis result includes: and receiving the log file which is existed in the application and is processed by the preprocessing module and the data analysis module and the corresponding running state thereof as training data, and training the model by a machine learning algorithm.

In some embodiments, during the training phase, system logs and job logs generated during the running of high performance applications on the target platform are collected, as well as the running state of the high performance applications as a test set. And analyzing the application logs in the test set by adopting a preprocessing module and a data analysis module to obtain corresponding digital feature vectors. And inputting the processed test set into a model based on deep learning for training to obtain a trained model. In the system deployment stage, the trained model is deployed on a target platform to configure the monitoring system. When the application runs, the preprocessing module automatically reads the system and the job log for processing to generate an intermediate data file. The data analysis module automatically analyzes the intermediate data file, generates a digital characteristic vector and substitutes the digital characteristic vector into the monitoring system to generate a high-performance application running state. For the condition that the monitoring system returns errors, such as 'node error' and 'operation overtime', current log information and corresponding errors are stored, so that users and operation and maintenance personnel can conveniently check the errors generated in the system or the job submitting process.

Where technically feasible, the technical features listed above for the different embodiments may be combined with each other or changed, added, omitted, etc. to form further embodiments within the scope of the invention.

It can be seen from the above embodiments that, in the high-performance application running state prediction and monitoring method provided by the embodiments of the present invention, on a large-scale high-performance computing platform, with the help of a data mining and machine learning model, a computer can automatically judge the running state of the high-performance application, and reduce the manual pressure; by using the data mining method, the application running state can be provided in real time, the utilization rate of platform computing resources is improved, and the queuing waiting time of computing tasks of users is reduced.

In view of the above object, another aspect of the embodiments of the present invention provides a high performance application running state predicting and monitoring device, as shown in fig. 2, including:

In some embodiments, the automated monitoring module is configured to: and receiving the log file which is existed in the application and is processed by the preprocessing module and the data analysis module and the corresponding running state thereof as training data, and training the model by a machine learning algorithm.

In some embodiments, the machine learning algorithm comprises: decision trees, random forests, artificial neural networks, bayesian learning. For example, in an embodiment according to the invention that employs a random forest training auto-detection module in machine learning, the main advantage of random forests is that they can be automatically evaluated using test sets, without cross-validation or using a separate test set to obtain an unbiased estimate of the error.

In some embodiments, the operational state includes: normal operation, user termination, node error, and run timeout. When a random deep forest method is used for training a model, the existing application running logs and running results need to be collected as training data, and states are divided into four types of 'normal running', 'user termination', 'node error' and 'running overtime' according to the running results of the previous application. Of course, it should be understood that the user may classify the operating conditions as desired, or further refine the classification.

In some embodiments, the data analysis module is further configured to: and extracting the key information in the intermediate data file by adopting a topic model LDA method in text modeling, and taking the probability distribution of the extracted key information as a characteristic vector of the key information.

In some embodiments, for a large number of log records generated by different components, the job log is mapped to the same-time entries in the system log using a preprocessing tool, generating an intermediate data file. The intermediate data file stores the time-varying related content in the log and contains the corresponding information of the system and the job log at each time. And then extracting key information of the intermediate data file by using a natural semantic processing tool in data mining, expressing the contained information by using a series of characteristic vectors, finally converting the information into numerical value form vectors which are used as input of a machine learning model, wherein the method also comprises the step of dividing application operation results into four types of normal operation, user termination, node error, operation overtime and the like according to the conventional operation log, and thus training the mined data and the corresponding operation results by using a random forest method in machine learning. After the training model is obtained, the digital vector generated by the application and processed by the preprocessing tool and the data mining tool is input into the trained model, so that the application running state on the high-performance platform can be predicted in real time, and the application running state is judged.

In some embodiments, as shown in fig. 2, the present invention comprises three parts: the device comprises a preprocessing module, a data analysis module and an automatic detection module. The pre-processing module creates an intermediate data file by parsing the text file. The method comprises the steps of after the application process is monitored to be started, extracting all system logs and job log messages after the operation is started, sequencing the messages in all the logs according to time, and storing the corresponding items of the system logs and the job logs with the same time as each other as an intermediate data file.

The data mining tool processes the acquired log file by using a data analysis means, and the final aim is to describe the text content in the intermediate data into a form of a digital vector. The intermediate data text contains data such as text, number, time, and the like, and therefore, the effect of directly performing data processing is poor. For this reason, the strategy adopted by us is to separate and analyze text, numeric and temporal contents separately, and the final goal is to describe the text contents in a set of system logs as numeric feature vectors. In the data processing process, a topic model LDA (latent Dirichlet allocation) method in text modeling is adopted to extract topics (key information) contained in the intermediate data text, and the probability distribution of the extracted topics is used as a feature vector.

And finally, training an automatic detection module by adopting a random forest method in machine learning. The main advantage of random forests is that they can be automatically evaluated using test sets without cross-validation or using a separate test set to obtain an unbiased estimate of the error. When the random deep forest method is used for training the model, the existing application running log and running results are required to be collected as training data, states are divided into four types of normal running, user termination, node error, running overtime and the like according to the previous application running results, the trained model can be deployed on a monitoring platform, analysis data of log files are provided through a preprocessing module and a data analysis module, and the application running state can be judged in real time.

The core of the invention is to provide an artificial intelligence related detection platform for the high-performance application running process, so that an artificial intelligence system can automatically detect and predict the application running state. According to monitoring a system and an operation log generated in the running process of the high-performance application, extracting characteristic keywords for identification, classifying by using a machine learning method, and finally realizing the prediction of the running state of the high-performance application.

It can be seen from the foregoing embodiments that, the high-performance application running state prediction and monitoring device provided in the embodiments of the present invention introduces a data mining and machine learning method into real-time monitoring of the running state of the high-performance computing application, establishes a high-performance cluster automatic operation and maintenance platform, realizes maximum utilization of computing resources, and allows a computer to automatically determine the running state of the high-performance application, thereby reducing labor pressure.

In view of the above, in another aspect, an embodiment of the present invention provides a high performance application running state predicting and monitoring apparatus, including:

at least one processor; and

Fig. 3 is a schematic hardware structure diagram of an embodiment of a high-performance application operating state predicting and monitoring apparatus provided in the present invention.

Taking the computer apparatus shown in fig. 3 as an example, the computer apparatus includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304.

The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.

The memory 302 is a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the high-performance application running state prediction and monitoring method in the embodiment of the present application. The processor 301 executes various functional applications of the server and data processing by running nonvolatile software programs, instructions and modules stored in the memory 302, that is, implements the high performance application running state prediction and monitoring method of the above-described method embodiment.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the high-performance application operation state prediction and monitoring method, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus for the high performance application operation state prediction and monitoring method. The output means 304 may comprise a display device such as a display screen.

Program instructions/modules corresponding to the one or more high-performance application running state prediction and monitoring methods are stored in the memory 302, and when executed by the processor 301, the high-performance application running state prediction and monitoring methods in any of the above-described method embodiments are executed.

Any embodiment of the computer device executing the method for predicting and monitoring the running state of the high-performance application can achieve the same or similar effects as any corresponding embodiment of the method.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

In addition, the apparatuses, devices and the like disclosed in the embodiments of the present invention may be various electronic terminal devices, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television and the like, or may be a large terminal device, such as a server and the like, and therefore the scope of protection disclosed in the embodiments of the present invention should not be limited to a specific type of apparatus, device. The client disclosed in the embodiment of the present invention may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.

Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions described herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.

The above-described embodiments are possible examples of implementations and are presented merely for a clear understanding of the principles of the invention. Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A high-performance application running state prediction and monitoring method is characterized by comprising the following steps:

2. The method of claim 1, wherein the application run state comprises: normal operation, user termination, node error, and run timeout.

3. The method of claim 1, wherein extracting key information of the intermediate data file by using a natural semantic processing tool in data mining, and labeling text information in the extracted key information with corresponding digital feature vectors comprises:

4. The method of claim 1, wherein the analyzing the numbers, the time and the text information labeled by the digital feature vectors in the intermediate data file through the model trained through the machine learning algorithm, respectively, and the determining the application running state based on the analysis result comprises:

5. A high performance application run state prediction and monitoring device, comprising:

6. The device of claim 5, wherein the auto-monitoring module is configured to:

7. The apparatus of claim 5, wherein the machine learning algorithm comprises: decision trees, random forests, artificial neural networks, bayesian learning.

8. The apparatus of claim 5, wherein the operational state comprises: normal operation, user termination, node error, and run timeout.

9. The device of claim 5, wherein the data analysis module is further configured to:

10. A high performance application run state prediction and monitoring apparatus, comprising:

at least one processor; and

a memory storing program code executable by the processor, the program code implementing the method of any one of claims 1-4 when executed by the processor.