CN115442270A

CN115442270A - Full-stack high-performance computing cluster monitoring system

Info

Publication number: CN115442270A
Application number: CN202211073239.2A
Authority: CN
Inventors: 王玲
Original assignee: Nanjing Xinyida Computing Technology Co ltd
Current assignee: Nanjing Xinyida Computing Technology Co ltd
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-12-06

Abstract

The invention discloses a full-stack high-performance computing cluster monitoring system, which comprises a monitoring module, a performance testing module, an ore digging program cleaning module and a data information security defense module, wherein the monitoring module is used for monitoring the performance of a mine digging program; the monitoring module is used for acquiring and summarizing data of each computing node, performing normalization processing, performing auxiliary monitoring on the current high-performance computing cluster application program and improving the accuracy of judging the running state of the high-performance computing cluster application; the performance testing module is used for determining a testing platform, deploying a system, testing the performance of the system, deploying an application, testing the application and analyzing data. According to the invention, the monitoring module is arranged to collect and summarize data of each computing node, and then normalization processing is carried out to carry out auxiliary monitoring on the current high-performance computing cluster application program, so that the accuracy of judging the running state of the high-performance computing cluster application program is improved, and the running controllability and stability of the high-performance cluster application program are obviously improved.

Description

Full-stack high-performance computing cluster monitoring system

Technical Field

The invention belongs to the technical field of monitoring systems, and particularly relates to a full-stack high-performance computing cluster monitoring system.

Background

Many modern project developments need to master multiple technologies to reduce communication cost and solve the problems of insufficient resources and closed loop. The value of the whole stack to the service is very large, and the technical capability of the whole stack has important influence on the overall planning of the whole service, the judgment and selection of the technical scheme, the positioning and the solution of the problems and the like. In addition, for entrepreneurial companies which are not complete in matching with various talents, various problems can be solved by the full stack, the multi-face is kept independently, the cost is saved, and the rapid development of services can be promoted in the early stage.

Traditional high performance cluster's purchasing cost is high, lead cycle length, and full stack formula high performance calculates the advantage: instantly acquiring HPC resources; the charging method supports various charging modes such as machine hour, month, season, year and the like, and saves the cost of customers; computing and storing massive elastic peaks and valleys meeting the business, and quickly completing a computing task; various computing resources such as the latest Intel and AMD platform CPU, the latest V100/P100 GPU and FPGA are adopted to easily meet the latest application requirements; an industry solution provides convenient SaaS application integration; and the corresponding operation flow is completed in a graphical interaction mode, so that the user can concentrate on application innovation.

The existing full-stack high-performance computing cluster monitoring system still has some problems: the running state of a traditional high-performance computing cluster application program is inconvenient to judge, the judging accuracy is reduced, the running controllability and the stability of the high-performance cluster application program are reduced, and how to discover a hidden ore digging program of a system and delete and clean the hidden ore digging program is provided for a high-energy computing cluster, so that the problem to be solved by the existing monitoring system is solved.

Disclosure of Invention

The present invention is directed to a full-stack high-performance computing cluster monitoring system to solve the problems set forth in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: the full-stack high-performance computing cluster monitoring system comprises a monitoring module, a performance testing module, an ore digging program cleaning module and a data information security defense module;

the monitoring module is used for acquiring and summarizing data of each computing node, performing normalization processing, performing auxiliary monitoring on the current high-performance computing cluster application program, and improving the accuracy of judging the running state of the high-performance computing cluster application;

the performance testing module is used for testing the performance state of the system and rapidly acquiring the characteristics of application software by determining a testing platform, deploying the system, testing the performance of the system, deploying the application, testing the application and analyzing the data;

the mine excavation program cleaning module is used for improving the conventional mine excavation program cleaning process, utilizing an open source tool and compiling a monitoring script of the mine excavation program cleaning module, so that a hidden mine excavation program and a network forwarding mode thereof under a high-performance computing cluster system can be quickly found and cleaned;

the data information security defense module is used for realizing the security of communication between an internal network and an external network through a firewall, filtering and intrusion protection are carried out on contents, the active defense function of the information is realized by adopting the principles of deepening, layering and active security defense, and the monitoring of each node can be realized while defense is carried out, so that intrusion detection and virus spreading are prevented, and the security performance of the information is improved.

Preferably, the monitoring system further comprises a base layer, an intermediate layer and an application layer, wherein the base layer comprises a monitoring host and bottom layer resources, the bottom layer resources comprise cpu, a memory, network throughput, hard disk I/O and hard disk usage, the intermediate layer comprises nginx, redis, MQ, mySQL and Tomcat, and the application layer comprises throughput of HTTP access, response time, return code, call link analysis, performance bottleneck and monitoring of a user side.

Preferably, the monitoring system further comprises a log system, the log system is used for storing data of the base layer, the middle layer and the application layer, and the log system is used for formatting log data, standardizing the format of the monitoring data and performing unified log analysis.

Preferably, the monitoring module includes an acquisition unit, a data processing unit, a training unit and an anomaly prediction unit, the acquisition unit is configured to acquire data of each computing node, the data processing unit is configured to perform threshold preprocessing and normalization processing on the data, the training unit is configured to train the data after the threshold preprocessing and normalization processing to form a deep network LSTM, and the anomaly prediction unit is configured to input the single data after the threshold preprocessing and normalization processing into the deep network LSTM to perform high-performance computing cluster application anomaly prediction.

Preferably, the method for cleaning the excavation program in the excavation program cleaning module specifically comprises the following steps:

s1, judging whether an ore digging program exists in a computing node cluster;

s2, acquiring a process number of an ore excavation program: the method for acquiring the process number of the ore excavation program comprises the following steps: judging whether the ore excavation program hides the process number, if not, directly obtaining the process number, and if so, searching the process number of the hidden ore excavation program by using an open source tool;

and S3, inquiring the interactive communication nodes capable of being connected to the Internet according to the process numbers, checking the communication nodes and closing the data flow of the mining program.

Preferably, the data information security defense module includes active defense unit, protocol analysis unit, prevents hot wall unit and monitoring unit, active defense unit is used for adopting deepening, layering and active security defense principle, realizes the function to the active defense of information, protocol analysis unit is used for realizing right monitoring unit, prevent hot wall unit play the effect of support, prevent hot wall unit is used for realizing the security of internal network and external network communication, monitoring unit is used for realizing content filtering and intrusion protection.

Preferably, the active defense unit comprises a safety early warning unit, a safety protection unit, a safety monitoring unit, a safety response unit, a system recovery unit and a safety counterattack unit, the safety early warning unit comprises a vulnerability early warning unit, a behavior early warning unit and an attack trend early warning unit, the vulnerability early warning unit is used for providing a patching opportunity for users, the behavior early warning unit and the attack trend early warning unit are used for predicting attack behaviors existing in a network by observing abnormal flow of the network, the safety protection unit is used for realizing network virus protection and Trojan killing and preventing network Trojan and virus from spreading, the safety monitoring unit is used for mining by adopting a software or hardware association rule analysis technology, the safety response unit is used for blocking safety threats of the defense system, the system recovery unit adopts an online incremental backup mode to realize backup of resource information, and the safety counterattack unit is used for realizing damage to an attack source.

Preferably, the monitoring system calculates the throughput of the HTTP access according to real-time data of the decision factor, and the formula is as follows:

wherein, P is a decision coefficient, ei is real-time data of a decision factor i, eimax is the current upper limit value of the decision factor i, and Wi is the weight of the decision factor i.

Preferably, the parallel number of threads is determined in the performance test module according to the following formula:

wherein, P is the parallel number of the threads; x is the input data volume; and S is the preset data processing speed of the single thread.

Preferably, the data information security defense module is further configured to evaluate and score the network security according to the acquired information based on an existing data set and an established risk evaluation model through a bayesian network machine learning algorithm, and specifically includes the following steps:

step one, definition of classification levels: the method comprises five grades of A, B, C, D and E, wherein the safety protection degree represented by the grade A is the highest, the safety protection degree represented by the grade E is the lowest, and the probability that the collected data information belongs to a certain grade is known according to Bayesian theorem as follows:

wherein, the vector X is the collected event set, the variables C and k are some specific risk level, specifically, P (C = C | X = X) is the conditional probability of the risk level of the collected event set, P (C = C) is the prior probability of the risk level, P (X = X | C = C) is the probability of different levels calculated from the collected events, and the denominator is the prior probability of the collected events themselves;

step two, with the help of the thought of naive Bayes, the feature vector X is assumed: that is, assuming that the features of each dimension in X are independent from each other, there is no relation between features, and the formula is obtained:

where the vector X is the set of all events collected, X _k N is the number of all elements for a specific event element;

step three, substituting the formula in the step one into the formula in the step two to obtain the class probability of the unknown sample with the feature vector X, wherein the formula is expressed as follows:

and the grade of the unknown sample with the characteristic vector X is the risk grade of the system network security at the moment.

Compared with the prior art, the invention has the beneficial effects that:

(1) According to the invention, the monitoring module is arranged to collect and summarize data of each computing node, and the current high-performance computing cluster application program is subjected to normalization processing to perform auxiliary monitoring, so that the accuracy of judging the running state of the high-performance computing cluster application program is improved, and the running controllability and stability of the high-performance cluster application program are obviously improved.

(2) The performance testing module is arranged, and the performance state of the system can be tested and the characteristics of the application software can be rapidly obtained by determining the testing platform, deploying the system, testing the performance of the system, deploying the application, testing the application and analyzing the data.

(3) According to the invention, by arranging the ore excavation program cleaning module, improving the conventional ore excavation program cleaning process, and utilizing the open-source tool and compiling the own monitoring script, the hidden ore excavation program and the network forwarding mode thereof under the high-performance computing cluster system can be quickly found and cleaned, the system safety guarantee is provided for the high-new-energy computing cluster, and the stability of the operation of the high-performance computing cluster is improved while the resource waste is reduced.

(4) By arranging the data information security defense module, the invention can realize active defense on information security, can defend the interior of the system from being attacked, has high defense performance and high equipment stability, can monitor each node while defending, prevents intrusion detection and virus spread, and improves the security performance of information.

Drawings

FIG. 1 is a block diagram of the present invention;

FIG. 2 is a block diagram of a monitoring module according to the present invention;

FIG. 3 is a block diagram of the data information security defense module of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a technical solution: the full-stack high-performance computing cluster monitoring system comprises a monitoring module, a performance testing module, an ore digging program cleaning module and a data information security defense module;

the performance testing module is used for testing the performance state of the system and quickly acquiring the characteristics of application software by determining a testing platform, deploying the system, testing the performance of the system, deploying the application, testing the application and analyzing the data;

the mining program cleaning module is used for improving the conventional mining program cleaning process, utilizing an open-source tool and compiling a monitoring script of the mining program cleaning module, so that the mining program cleaning module can quickly find and clean a hidden mining program and a network forwarding mode thereof under a high-performance computing cluster system;

In this embodiment, preferably, the monitoring system further includes a base layer, an intermediate layer, and an application layer, where the base layer includes a monitoring host and a bottom layer resource, the bottom layer resource includes cpu, a memory, a network throughput, a hard disk I/O, and a hard disk usage, the intermediate layer includes nginx, redis, MQ, mySQL, and Tomcat, and the application layer includes throughput of HTTP access, response time, a return code, a call link analysis, a performance bottleneck, and monitoring of a user side.

In this embodiment, preferably, the monitoring system further includes a log system, the log system is configured to store data of the base layer, the intermediate layer, and the application layer, and the log system is configured to format log data, standardize a format of monitoring data, and perform unified log analysis.

In this embodiment, preferably, the monitoring module includes an acquisition unit, a data processing unit, a training unit, and an anomaly prediction unit, where the acquisition unit is configured to acquire data of each computing node, the data processing unit is configured to perform threshold preprocessing and normalization processing on the data, the training unit is configured to train the data subjected to the threshold preprocessing and normalization processing to form a deep network LSTM, and the anomaly prediction unit is configured to input the single data subjected to the threshold preprocessing and normalization processing into the deep network LSTM to perform high-performance computing cluster application anomaly prediction.

In this embodiment, preferably, the method for cleaning the excavation program in the excavation program cleaning module specifically includes:

s1, judging whether an ore digging program exists in a computing node cluster;

s2, acquiring a process number of an ore excavation program: the method for acquiring the process number of the ore excavation program comprises the following steps: judging whether the process number is hidden by the ore excavation program, if not, directly acquiring the process number, and if hidden, searching the process number of the hidden ore excavation program by using an open source tool;

In this embodiment, preferably, the data information security defense module includes an active defense unit, a protocol analysis unit, a firewall unit and a monitoring unit, the active defense unit is configured to adopt deepening, layering and active security defense principles to realize active defense function on information, the protocol analysis unit is configured to realize that the monitoring unit and the firewall unit play a supporting role, the firewall unit is configured to realize security of communication between an internal network and an external network, and the monitoring unit is configured to realize content filtering and intrusion protection.

In this embodiment, it is preferable that the active defense unit includes a security early warning unit, a security protection unit, a security monitoring unit, a security response unit, a system recovery unit, and a security counterattack unit, the security early warning unit includes a vulnerability early warning unit, a behavior early warning unit, and an attack trend early warning unit, the vulnerability early warning unit is used for providing a chance of patching for a user, the behavior early warning unit and the attack trend early warning unit are used for predicting an attack behavior existing in a network by observing abnormal traffic of the network, the security protection unit is used for implementing network virus protection and Trojan killing, and preventing network Trojan and virus from spreading, the security monitoring unit is used for mining by using a software or hardware association rule analysis technology, the security response unit is used for blocking security threats of the defense system, the system recovery unit uses an online incremental backup mode to implement backup of resource information, and the security counterattack unit is used for implementing damage to an attack source.

In this embodiment, preferably, the throughput of HTTP access is calculated in the monitoring system through real-time data of the decision factor, and the formula is calculated as follows:

In this embodiment, preferably, the parallel number of threads is determined in the performance test module according to the following formula:

wherein, P is the thread parallel number; x is input data volume; and S is the preset data processing speed of the single thread.

In this embodiment, preferably, the data information security defense module is further configured to evaluate and score the network security according to the collected information based on an existing data set and an established risk evaluation model through a bayesian network machine learning algorithm, and specifically includes the following steps:

step two, with the help of the thought of naive Bayes, the characteristic vector X is assumed: that is, assuming that the features of each dimension in X are independent from each other, there is no relation between features, and the formula is obtained:

step three, substituting the formula in the step one into the formula in the step two to obtain the class probability of the unknown sample with the characteristic vector X, wherein the formula is expressed as follows:

the rank of the unknown sample with the feature vector X is the risk rank of the system network security at this time.

The principle and the advantages of the invention are as follows: according to the invention, the monitoring module is arranged to collect and summarize data of each computing node, and then normalization processing is carried out to carry out auxiliary monitoring on the current high-performance computing cluster application program, so that the accuracy of judging the running state of the high-performance computing cluster application program is improved, and the running controllability and stability of the high-performance cluster application program are obviously improved; by setting the performance test module, the system performance state can be tested and the application software characteristics can be rapidly obtained by determining the test platform, carrying out system deployment, carrying out system performance test, carrying out application deployment, carrying out application test and analyzing data; by arranging the ore excavation program cleaning module, improving the conventional ore excavation program cleaning process, and utilizing an open-source tool and compiling a monitoring script, the hidden ore excavation program and the network forwarding mode thereof under the high-performance computing cluster system can be quickly found and cleaned, the system safety guarantee is provided for the high-new-energy computing cluster, and the running stability of the high-performance computing cluster is improved while the resource waste is reduced; through setting up data information security defense module, can realize when the initiative defense to the safety of information, can also defend the inside attack that receives of system, defense performance is high, and equipment stability can be high, when the defense, can also prevent the spreading of intrusion detection and virus to the control of each node, has improved the security performance of information.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. Full stack formula high performance calculation cluster monitored control system, its characterized in that: the system comprises a monitoring module, a performance testing module, an ore digging program cleaning module and a data information security defense module;

2. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the monitoring system further comprises a base layer, a middle layer and an application layer, wherein the base layer comprises a monitoring host and bottom layer resources, the bottom layer resources comprise a cpu, a memory, network throughput, hard disk I/O and hard disk usage, the middle layer comprises nginx, redis, MQ, mySQL and Tomcat, and the application layer comprises the throughput of HTTP access, response time, return codes, call link analysis, performance bottlenecks and monitoring of user sides.

3. The full-stack high-performance computing cluster monitoring system of claim 2, wherein: the monitoring system also comprises a log system, wherein the log system is used for storing the data of the basic layer, the middle layer and the application layer, and the log system is used for formatting log data, standardizing the format of monitoring data and carrying out unified log analysis.

4. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the monitoring module comprises an acquisition unit, a data processing unit, a training unit and an anomaly prediction unit, wherein the acquisition unit is used for acquiring data of each computing node, the data processing unit is used for carrying out threshold value preprocessing and normalization processing on the data, the training unit is used for training the data subjected to the threshold value preprocessing and the normalization processing to form a deep network LSTM, and the anomaly prediction unit is used for inputting the single data subjected to the threshold value preprocessing and the normalization processing into the deep network LSTM to carry out high-performance computing cluster application anomaly prediction.

5. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the method for cleaning the ore excavation program in the ore excavation program cleaning module specifically comprises the following steps:

s1, judging whether an ore digging program exists in a computing node cluster;

6. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the data information security defense module comprises an active defense unit, a protocol analysis unit, a firewall unit and a monitoring unit, wherein the active defense unit is used for adopting deepening, layering and active security defense principles to realize active defense functions on information, the protocol analysis unit is used for realizing the support effect of the monitoring unit and the firewall unit, the firewall unit is used for realizing the communication security between an internal network and an external network, and the monitoring unit is used for realizing content filtering and intrusion protection.

7. The full-stack high-performance computing cluster monitoring system of claim 6, wherein: the active defense unit comprises a safety early warning unit, a safety protection unit, a safety monitoring unit, a safety response unit, a system recovery unit and a safety counterattack unit, the safety early warning unit comprises a vulnerability early warning unit, a behavior early warning unit and an attack trend early warning unit, the vulnerability early warning unit is used for providing a patching opportunity for users, the behavior early warning unit and the attack trend early warning unit are used for predicting attack behaviors existing in a network by observing abnormal flow of the network, the safety protection unit is used for achieving network virus protection and Trojan investigation and killing and preventing network Trojan and virus spreading, the safety monitoring unit is used for excavating by adopting a software or hardware association rule analysis technology, the safety response unit is used for blocking safety threats of a defense system, the system recovery unit adopts an online incremental backup mode to achieve backup of resource information, and the safety counterattack unit is used for achieving damage to an attack source.

8. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the monitoring system calculates the throughput of HTTP access through real-time data of decision factors, and the formula is as follows:

9. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the performance testing module determines the parallel line number of the threads according to the following formula:

wherein, P is the parallel number of the threads; x is input data volume; and S is the preset data processing speed of the single thread.

10. The full-stack high-performance computing cluster monitoring system of claim 1, wherein: the data information security defense module is also used for evaluating and scoring the network security based on the existing data set and the established risk evaluation model through a Bayesian network machine learning algorithm according to the acquired information, and specifically comprises the following steps:

step one, defining classification levels: the method comprises five grades of A, B, C, D and E, wherein the safety protection degree represented by the grade A is the highest, the safety protection degree represented by the grade E is the lowest, and the probability that the collected data information belongs to a certain grade is known according to Bayesian theorem as follows: