CN115543671A - Data analysis method, device, equipment, storage medium and program product - Google Patents

Data analysis method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN115543671A
CN115543671A CN202211182174.5A CN202211182174A CN115543671A CN 115543671 A CN115543671 A CN 115543671A CN 202211182174 A CN202211182174 A CN 202211182174A CN 115543671 A CN115543671 A CN 115543671A
Authority
CN
China
Prior art keywords
node
data
indexes
cluster
slow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211182174.5A
Other languages
Chinese (zh)
Inventor
黄湘平
李申浩
崔波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202211182174.5A priority Critical patent/CN115543671A/en
Publication of CN115543671A publication Critical patent/CN115543671A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a data analysis method, a device, equipment, a storage medium and a program product, and relates to the technical field of data analysis, wherein the method comprises the following steps: collecting data indexes of each node in a target cluster; the data indicator includes at least one of: node task state index, network flow and operating system index; predicting the fault probability of each node through a machine learning model according to the data indexes of each node; and determining the slow nodes in the target cluster according to the predicted fault probability of each node. The method provided by the embodiment of the application can solve the problem that the slow nodes in the cluster cannot be quickly positioned on the basis of not influencing production and operation in the prior art.

Description

Data analysis method, device, equipment, storage medium and program product
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a data analysis method, apparatus, device, storage medium, and program product.
Background
When a large-scale commercial bank constructs a large data center, a Massively Parallel Processing (MPP) database cluster (for short, a cluster) is adopted. With the development of information technology, the information data volume and the service data volume are in a sharp surge state, and the scale of a data center constructed by a large commercial bank is increased. In order to carry more data, the size of the MPP database cluster is also getting larger, and the number of machines in the same cluster is also increasing. For example, according to the traffic data volume or the demand of computing resources, the number of nodes of a cluster usually ranges from dozens of nodes to hundreds of nodes, and since the use frequency of each node is different, the quality of hardware is also different, and in the long-term use process of the cluster, the loss of the nodes gradually differs, so that the performance of each node gradually differs. When the cluster performance is reduced, the whole batch time is prolonged, and at this time, a node with lower performance in the cluster needs to be quickly positioned, and the cluster is isolated and repaired.
At present, the existing solution is to stop the service of the whole cluster, manually perform performance tests on each node, perform manual comparison on the results of the performance tests, locate nodes with lower performance, perform troubleshooting on the nodes, and further determine slow nodes. The whole process is operated manually, the efficiency is low, the operation of the production system is required to be stopped, namely, the external service is stopped, and the capability of continuous operation of the whole system is greatly reduced.
Therefore, the prior art cannot rapidly locate the slow nodes in the cluster on the basis of not influencing production and operation.
Disclosure of Invention
Embodiments of the present application provide a data analysis method, apparatus, device, storage medium, and program product, so as to overcome a problem that a slow node in a cluster cannot be quickly located on the basis of not affecting production and operation in the prior art.
In a first aspect, an embodiment of the present application provides a data analysis method, where the method includes:
collecting data indexes of each node in a target cluster; the data indicators include at least one of: node task state index, network flow and operating system index;
predicting the fault probability of each node through a machine learning model according to the data index of each node;
and determining the slow nodes in the target cluster according to the predicted failure probability of each node.
In one possible design, the node task state indicators include: the task quantity, the resource information occupied by the task, the task execution time, the propagation delay, the transmission delay and the queuing delay; the network flow is collected by a bypass network monitoring device; the operating system metrics include: instruction response time, network time protocol error time, transaction amount per second, CPU load, memory utilization rate, disk capacity ratio, disk load and node file number;
the predicting and obtaining the fault probability of each node through a machine learning model according to the data indexes of each node comprises the following steps:
detecting outliers and invalid values in data indexes of the nodes by an iterative self-organizing data analysis algorithm aiming at each node, and eliminating the outliers and the invalid values to obtain target data;
performing correlation analysis on the features in the data indexes, determining the correlation among the features in the data indexes, and taking the correlation as a feature correlation index;
and inputting the target data and the characteristic correlation indexes into the machine learning model, and predicting to obtain the fault probability of the node.
In one possible design, the method further includes:
periodically collecting data indexes of each node in each cluster in a plurality of clusters, and taking the data indexes of each node in each cluster as historical indexes;
preprocessing the historical indexes aiming at each node in each cluster to obtain target historical data, and generating historical characteristic association indexes according to the historical indexes;
acquiring a historical analysis result of whether each node in each cluster is a slow node;
and taking the historical indexes of the nodes in each cluster, the corresponding historical characteristic correlation indexes and historical analysis results as samples, and training a machine learning model through the samples.
In one possible design, the target cluster is at least one cluster; the method further comprises the following steps:
storing periodically acquired data indexes of each node in a target cluster, data index acquisition time, feature association indexes, metadata of the node and corresponding fault probability;
and generating a cluster slow node monitoring database according to the data indexes of all nodes in the target cluster, the data index acquisition time, the characteristic correlation index, the metadata of the nodes and the corresponding fault probability, so as to support the unified management of the node data and the retrospective investigation of the analysis result containing the fault probability.
In one possible design of the system, the system may be,
the determining slow nodes in the target cluster according to the predicted failure probability of each node comprises:
for each node, if the predicted failure probability of the node is greater than or equal to a preset probability threshold, determining that the node is a slow node in the target cluster;
correspondingly, the method further comprises the following steps:
triggering alarm operation aiming at the slow node, and counting alarm times;
and generating a slow node analysis result report according to the data index acquisition time, the alarming times and the data index corresponding to the slow node, so as to support the judgment of whether node isolation and repair are carried out.
In one possible design, the method further includes:
according to the cluster slow node monitoring database, searching historical related data corresponding to the slow nodes through the identification of the slow nodes;
and determining the reason for forming the slow node according to the historical related data and the slow node analysis result report.
In a second aspect, an embodiment of the present application provides a data analysis apparatus, including:
the device comprises:
the acquisition module is used for acquiring data indexes of each node in the target cluster; the data indicator includes at least one of: node task state index, network flow and operating system index;
the prediction module is used for predicting the fault probability of each node through a machine learning model according to the data indexes of each node;
and the data analysis module is used for determining the slow nodes in the target cluster according to the predicted fault probability of each node.
In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor and memory;
the memory stores computer execution instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the data analysis method as described above in the first aspect and various possible designs of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the data analysis method according to the first aspect and various possible designs of the first aspect is implemented.
In a fifth aspect, embodiments of the present application provide a computer program product, which includes a computer program that, when executed by a processor, implements the data analysis method as described in the first aspect and various possible designs of the first aspect.
In the data analysis method, the apparatus, the device, the storage medium, and the program product provided in this embodiment, first, data indexes of each node in a target cluster are collected; the data indicators include at least one of: node task state index, network flow and operating system index; then, according to the data indexes of the nodes, predicting the fault probability of the nodes through a machine learning model; and then determining the slow nodes in the target cluster according to the predicted failure probability of each node. Therefore, the fault probability of the nodes is predicted by collecting various data closely related to the performance of each node of the cluster and based on the established machine learning model multidimensional analysis, external service does not need to be stopped, less downtime is achieved, the fault probability is predicted through the machine learning model, slow nodes can be accurately and quickly positioned, and further, whether node isolation and repair are carried out or not can be supported and judged, and the continuous and stable operation of a service system is ensured.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic scene diagram of a data analysis method according to an embodiment of the present application;
fig. 2 is a schematic view of a scenario of a data analysis method according to yet another embodiment of the present application;
fig. 3 is a schematic flowchart of a data analysis method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the preceding drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
At present, the existing scheme is to stop the service of the whole cluster, manually perform performance tests on each node, perform manual comparison on the results of the performance tests, locate nodes with lower performance, perform troubleshooting on the nodes, and further determine slow nodes. The whole process is operated manually, the efficiency is low, the operation of the production system is required to be stopped, namely, the external service is stopped, and the capability of continuous operation of the whole system is greatly reduced. Therefore, the prior art cannot rapidly locate the slow nodes in the cluster on the basis of not influencing production and operation.
Aiming at the problems in the prior art, the technical idea of the application is that the fault probability of the nodes is predicted by collecting various data closely related to the performance of each node of the cluster and based on the established machine learning model multidimensional analysis, external service does not need to be stopped, less downtime is achieved, the fault probability is predicted through the machine learning model, slow nodes can be accurately and quickly positioned, and then whether node isolation and repair are carried out or not can be supported and judged, so that the continuous and stable operation of a service system is ensured.
Interpretation of terms:
MPP database cluster: the data base cluster optimized aiming at the analysis workload is composed of a plurality of servers and can aggregate and process large data sets, namely the cluster for short.
The big data center: a system set of multiple database clusters and auxiliary systems is created for processing large-scale datasets.
Slow node: among nodes in the MPP cluster, due to the reasons of hardware aging, insufficient resources and the like, the speed of processing external requests is lower than the threshold. The slow nodes may cause performance degradation of the overall cluster due to the barrel effect.
The number of tasks: the number of tasks per node in the cluster at a certain time or over a certain time period.
Task occupied resource information (i.e., task occupied resource situation): and the task on each node in the cluster occupies the resource condition.
And (3) task execution time: the execution time of each historical task and existing tasks in the node.
Propagation delay: the time required for an electrical or optical signal to propagate from one node to another.
Transmission delay: the time required to transmit a certain amount of data at a given transmission rate.
Queuing delay: queuing delay occurs when multiple servers want to send data over the network at the same time.
Data traffic statistics (i.e. network traffic): and by-pass network monitoring equipment is introduced to perform non-disturbance collection on the network flow of each node, so that the performance index of a key system is increased, and the detection accuracy of the slow node is enhanced. Instruction response time: the time required from the execution of the instruction to the feedback of the execution result. Such as the time required to execute an SQL statement.
Network Time Protocol (NTP) error: NTP error time.
Performance test index (TPS): transaction amount per second.
In practical applications, referring to fig. 1, fig. 1 is a schematic view of a scene of a data analysis method provided in an embodiment of the present application. The execution subject of the present application may be a data analysis apparatus, which may be deployed in an electronic device, such as a terminal device or a server.
Illustratively, taking a cluster as an example, the cluster is composed of a plurality of servers, where the servers are nodes, such as: node 1, node 2, \8230, node n. The servers are connected through the switch, bypass network monitoring equipment can be connected with the switch, and the bypass network monitoring equipment counts the network flow of each node in a non-invasive mode to increase the collected performance index.
Specifically, a data acquisition module is arranged at each node of the cluster, and key index data (namely data indexes) of each node are collected and used as performance analysis data; by introducing the bypass network monitoring equipment, the network flow of each node is collected without disturbance, the performance index of a key system is increased, and the detection accuracy of the slow node is enhanced. Then, an iterative self-organizing data analysis algorithm is adopted to detect outliers in the collected original data (namely data indexes), eliminate outliers and partial invalid values, reduce the influence of noise data on an analysis model, perform characteristic analysis and correlation analysis on the collected original indexes (namely the data indexes), generate new characteristic indexes, and store the new characteristic indexes in a characteristic index database for optimizing the data analysis model in a subsequent period. And inputting the preprocessed data indexes and the new characteristic indexes into the established machine learning model, and outputting the prediction result, namely the fault probability, of each node. And determining which node or nodes are slow nodes based on the fault probability and a predefined probability threshold value, so that the slow nodes with poor performance can be quickly and accurately positioned.
The data analysis device herein may be a cluster performance monitoring system, and fig. 2 is a schematic view of a scenario of a data analysis method according to another embodiment of the present application, with reference to fig. 2. The cluster performance monitoring system may include a data acquisition subsystem, a data processing and index analysis subsystem, a data management subsystem, and a query and alarm subsystem.
Specifically, the data collection subsystem is configured to collect relevant indexes of the cluster periodically, where the relevant indexes may include, but are not limited to, a task state of the cluster (i.e., a task state of a node), a network level index (i.e., network traffic), and a system level index (i.e., an operating system index).
And the data processing and index analysis subsystem is used for preprocessing the original data, establishing a machine learning algorithm model to analyze the preprocessed indexes and calculating the probability of the slow nodes. The method comprises the steps of establishing an independent node or cluster database, and uniformly storing statistical index data according to a specified format for subsequent algorithm model data calling.
And the data management subsystem is used for storing all index information, cluster node metadata, slow node probability value and the like in the characteristic index library. The data management subsystem stores the data analyzed and processed by the data processing and index analysis subsystem, and generates a cluster slow node monitoring database by including information acquisition time (data index acquisition time here), original index data (data index here), new characteristic index data (characteristic associated index here) generated by association analysis, cluster node metadata, algorithm model analysis results (predicted failure probability here) and the like, so that unified management of the data and retrospective examination of the analysis results are facilitated.
Query and alarm subsystem: the operation and maintenance personnel can log in the query and alarm subsystem, query the operation indexes and slow node risks of each node of the current cluster and generate a slow node analysis result report according to the needs; setting a slow node alarm threshold value according to the requirement; and triggering an alarm for the node with the slow node analysis value probability above the threshold (namely the node with the predicted fault probability greater than or equal to the preset probability threshold), and timely informing system personnel whether to carry out node isolation and repair. And carrying out node isolation based on the screening result, and inputting the screening result into a database. And the method is also used for supporting the optimization of historical data in the database, optimizing model parameters and further improving the accuracy of the analysis model.
Therefore, the fault probability of the nodes is predicted by collecting various data closely related to the performance of each node of the cluster and based on the established machine learning model multi-dimensional analysis, external service does not need to be stopped, the down time is short, the fault probability is predicted through the machine learning model, the slow nodes can be accurately and quickly positioned, and further, whether node isolation and repair are carried out or not can be supported and judged, so that the continuous and stable operation of a service system is ensured.
The technical solution of the present application will be described in detail below with specific examples. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 3 is a schematic flowchart of a data analysis method provided in an embodiment of the present application, where the method may include:
s301, collecting data indexes of all nodes in the target cluster.
Wherein the data indicators include at least one of: node task state index, network flow, operating system index.
In this embodiment, the execution subject of the data analysis method may be a data analysis apparatus, and the data analysis apparatus may be deployed in an electronic device, such as a terminal device, a server, and the like. Specifically, a data collection module or a data collection subsystem may be installed at each node (i.e., server) in the cluster to periodically collect data indexes, such as node task state indexes, network traffic, operating system indexes, and the like.
S302, according to the data indexes of the nodes, the fault probability of the nodes is obtained through prediction of a machine learning model.
In this embodiment, a machine learning model, for example, a neural network model, may be trained to support multiple input and multiple output, that is, multidimensional feature data of each node is input into the trained neural network model, and a prediction result used for determining whether each node is a slow node is output, where the prediction result may be a failure probability. Therefore, the fault probability for determining whether the node is the slow node can be predicted through the trained machine learning model, and the slow node positioning efficiency and accuracy are improved.
In one possible design, how to train the machine learning model can be achieved by:
a1, periodically collecting data indexes of each node in each cluster in a plurality of clusters, and taking the data indexes of each node in each cluster as historical indexes;
a2, preprocessing the historical indexes aiming at each node in each cluster to obtain target historical data, and generating historical characteristic association indexes according to the historical indexes;
a3, acquiring a historical analysis result of whether each node in each cluster is a slow node;
and a4, taking the historical indexes of each node in each cluster, the corresponding historical characteristic correlation indexes and historical analysis results as samples, and training a machine learning model through each sample.
In this embodiment, taking a cluster as an example, data indexes of each node in the cluster are periodically collected, and the data indexes of each node in each cluster are used as history indexes. Before training the machine learning model, the data indices may be preprocessed to reduce the effect of noisy data on the model. The preprocessing process can be to use a self-organizing data analysis algorithm to detect outliers in the original data and reject outliers and partial invalid values.
In addition, in order to increase the dimension of the input amount and further improve the accuracy of the machine learning model, feature analysis and correlation analysis may be performed on the data index to generate a new feature index, that is, a correlation feature index.
And then, analyzing results based on the historical indexes and the historical associated characteristic indexes, and if the results are in the early stage, checking and selecting the slow nodes based on a mathematical model and operation and maintenance experience: on one hand, an experience method is accumulated for positioning slow node problems, and on the other hand, training data support is provided for training of a machine learning algorithm model. And taking the analysis result (such as the node is a slow node and marked as 0; and the node is a non-slow node and marked as 1.) as a label of the training sample. And continuously carrying out model training on historical analysis data (including historical indexes, historical associated characteristic indexes and analysis results) to optimize model parameters. Specifically, an artificial neural network algorithm is constructed for the statistical indexes, and a function analytic expression of each statistical index and the slow node probability is fitted.
Therefore, the fault probability for determining whether the node is the slow node can be predicted through the trained machine learning model, and the slow node positioning efficiency and accuracy are improved.
S303, according to the predicted fault probability of each node, determining the slow nodes in the target cluster.
In one possible design, the determining the slow node in the target cluster according to the predicted failure probability of each node may be implemented by:
and for each node, if the predicted failure probability of the node is greater than or equal to a preset probability threshold, determining the node as a slow node in the target cluster.
In the embodiment of the disclosure, whether the condition of the slow node is reached can be determined by comparing the set threshold with the failure probability. And triggering an alarm for the node with the probability of the slow node reaching the threshold value, and timely informing system personnel whether to carry out node isolation and repair.
The data analysis method provided by the application comprises the steps of collecting data indexes of all nodes in a target cluster; the data indicator includes at least one of: node task state index, network flow and operating system index; then, according to the data indexes of the nodes, predicting the fault probability of the nodes through a machine learning model; and then determining the slow nodes in the target cluster according to the predicted fault probability of each node. Therefore, the fault probability of the nodes is predicted by collecting various data closely related to the performance of each node of the cluster and based on the established machine learning model multi-dimensional analysis, external service does not need to be stopped, the down time is short, the fault probability is predicted through the machine learning model, the slow nodes can be accurately and quickly positioned, and further, whether node isolation and repair are carried out or not can be supported and judged, so that the continuous and stable operation of a service system is ensured.
In a possible design, the predicting, by using a machine learning model, the failure probability of each node according to the data index of each node may be implemented by:
b1, aiming at each node, detecting an outlier and an invalid value in a data index of the node through an iterative self-organizing data analysis algorithm, and eliminating the outlier and the invalid value to obtain target data;
b2, performing correlation analysis on the features in the data indexes, determining the correlation among the features in the data indexes, and taking the correlation as a feature correlation index;
and b3, inputting the target data and the characteristic correlation index into the machine learning model, and predicting to obtain the fault probability of the node.
Wherein the node task state indicators include: the task quantity, the resource information occupied by the task, the task execution time, the propagation delay, the transmission delay and the queuing delay; the network flow is collected by bypass network monitoring equipment; the operating system metrics include: instruction response time, network time protocol error time, transaction amount per second, CPU load, memory utilization rate, disk capacity ratio, disk load and node file number.
In this embodiment, in order to reduce the influence of noise data on the model, the data indexes may be preprocessed, and then the preprocessed indexes are input into the trained machine learning model based on the combination of the new indexes generated by the data indexes as input quantities, so as to predict the failure probability of each node.
Specifically, an outlier and an invalid value in a data index of each node are detected through an iterative self-organizing data analysis algorithm, and the outlier and the invalid value are removed to obtain target data corresponding to each node; in order to further improve the accuracy of the machine learning model, a new index can be added on the basis of multi-dimensional target data, wherein the new index is obtained by performing correlation analysis on features in the data index to determine the correlation among the features in the data index, the correlation is used as a feature correlation index, namely, the new index, and then the target data and the feature correlation index of each node are input into the machine learning model to predict the fault probability of each node.
In a possible design, the present embodiment provides a detailed description of the data analysis method based on the above embodiments. The method can also be realized by the following steps:
step c1, storing periodically acquired data indexes of each node in the target cluster, data index acquisition time, characteristic association indexes, node metadata and corresponding fault probability;
and c2, generating a cluster slow node monitoring database according to the data indexes of all nodes in the target cluster, the data index acquisition time, the characteristic association indexes, the metadata of the nodes and the corresponding fault probability, and supporting the unified management of the node data and the retroactive investigation of the analysis result containing the fault probability.
Wherein the target cluster is at least one cluster.
In this embodiment, a cluster slow node monitoring database is generated by periodically collecting data indexes, so as to support unified management of node data and retrospective investigation of analysis results including failure probability.
Specifically, all index information (such as data indexes and new indexes) in the feature index library, cluster node metadata, a slow node probability value (i.e., a failure probability of a slow node), and the like are stored. The index data and the data of the new index existing in the processing process are stored, the index data comprises information acquisition time, original index data, new characteristic index data generated by correlation analysis, cluster node metadata, algorithm model analysis results and the like, a cluster slow node monitoring database is generated, and unified management of the data and retrospective investigation of the analysis results are facilitated. For example, the supervision monitors at which stage the slow node has a problem, etc.
In one possible design, the data analysis method may further be implemented by:
step d1, triggering alarm operation aiming at the slow node, and counting alarm times;
and d2, generating a slow node analysis result report according to the data index acquisition time, the alarming times and the data index corresponding to the slow node, so as to support the judgment of whether node isolation and repair are carried out.
In this embodiment, since the acquisition is performed periodically, in each acquisition period, it may be determined whether there is a slow node, and an alarm is triggered for the slow node, and the number of times of alarms is counted, where if three periods are acquired in the current stage, and there is a slow node 1 in each period, an alarm may be triggered for the slow node 1 in each period, and the number of times of triggered alarms is counted to be 3.
Specifically, the operation and maintenance personnel can log in the query and alarm subsystem to query the operation indexes and slow node risks of each node of the current cluster and generate a slow node analysis result report. And triggering an alarm for the node with the slow node analysis value probability reaching the threshold value, and timely informing system personnel whether to carry out node isolation and repair.
In one possible design, the data analysis method may be further implemented by:
step e1, searching historical related data corresponding to the slow nodes according to the cluster slow node monitoring database and the identification of the slow nodes;
and e2, determining the reason for forming the slow node according to the historical related data and the slow node analysis result report.
In this embodiment, according to the identifier of the slow node, historical related data corresponding to the slow node, such as a data index of the slow node, data index acquisition time, a feature association index, metadata of the node, and a corresponding failure probability, may be searched from the cluster slow node monitoring database, and then according to the identifier of the slow node, the corresponding alarm times and specific indexes are searched from the slow node analysis result report, whether to isolate and repair the node is determined, and if the screening result is node isolation, the screening result is entered into the database, historical data in the database is optimized, model parameters are optimized, and accuracy of the analysis model is further improved.
In order to implement the data analysis method, the present embodiment provides a data analysis apparatus. Referring to fig. 4, fig. 4 is a schematic structural diagram of a data analysis apparatus provided in the embodiment of the present application; a data analysis device 40 comprising:
the acquisition module 401 is configured to acquire data indexes of each node in the target cluster; the data indicator includes at least one of: node task state index, network flow and operating system index;
a prediction module 402, configured to predict, according to the data index of each node, a fault probability of each node through a machine learning model;
and a data analysis module 403, configured to determine slow nodes in the target cluster according to the predicted failure probability of each node.
In this embodiment, the data acquisition module 401, the prediction module 402, and the data analysis module 403 are configured to acquire data indexes of each node in the target cluster; the data indicators include at least one of: node task state indexes, network flow and operating system indexes; then, according to the data indexes of the nodes, predicting the fault probability of the nodes through a machine learning model; and then determining the slow nodes in the target cluster according to the predicted fault probability of each node. Therefore, the fault probability of the nodes is predicted by collecting various data closely related to the performance of each node of the cluster and based on the established machine learning model multi-dimensional analysis, external service does not need to be stopped, the down time is short, the fault probability is predicted through the machine learning model, the slow nodes can be accurately and quickly positioned, and further, whether node isolation and repair are carried out or not can be supported and judged, so that the continuous and stable operation of a service system is ensured.
The apparatus provided in this embodiment may be configured to implement the technical solutions of the method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
In one possible design, the node task state metrics include: the task quantity, the resource information occupied by the task, the task execution time, the propagation delay, the transmission delay and the queuing delay; the network flow is collected by a bypass network monitoring device; the operating system metrics include: instruction response time, network time protocol error time, transaction amount per second, CPU load, memory utilization rate, disk capacity ratio, disk load and node file number;
the prediction module is specifically configured to:
detecting an outlier and an invalid value in a data index of each node through an iterative self-organizing data analysis algorithm, and eliminating the outlier and the invalid value to obtain target data;
performing correlation analysis on the features in the data indexes, determining the correlation among the features in the data indexes, and taking the correlation as a feature correlation index;
and inputting the target data and the characteristic correlation indexes into the machine learning model, and predicting to obtain the fault probability of the node.
In one possible design, the data analysis device further includes: a model training module; a model training module to:
periodically collecting data indexes of each node in each cluster in a plurality of clusters, and taking the data indexes of each node in each cluster as historical indexes;
preprocessing the historical indexes aiming at each node in each cluster to obtain target historical data, and generating historical characteristic association indexes according to the historical indexes;
obtaining a historical analysis result of whether each node in each cluster is a slow node;
and taking the historical indexes of the nodes in each cluster, the corresponding historical characteristic correlation indexes and historical analysis results as samples, and training a machine learning model through the samples.
In one possible design, the target cluster is at least one cluster; the data analysis apparatus further includes: the device comprises a storage module and a generation module;
the storage module is used for storing the periodically acquired data indexes of each node in the target cluster, the data index acquisition time, the characteristic association index, the metadata of the node and the corresponding fault probability;
and the generating module is used for generating a cluster slow node monitoring database according to the data indexes of all nodes in the target cluster, the data index acquisition time, the characteristic association indexes, the metadata of the nodes and the corresponding fault probability, so as to support the unified management of the node data and the retrospective investigation of the analysis result containing the fault probability.
In one possible design, the data analysis module is specifically configured to:
for each node, if the predicted failure probability of the node is greater than or equal to a preset probability threshold, determining that the node is a slow node in the target cluster;
correspondingly, the data analysis device further comprises: an alarm module;
the warning module is used for triggering warning operation aiming at the slow node and counting the warning times;
wherein the generation module is further configured to:
and generating a slow node analysis result report according to the data index acquisition time, the alarming times and the data index corresponding to the slow node, so as to support the judgment of whether node isolation and repair are carried out.
In one possible design, the data analysis device further includes: a processing module; a processing module to:
according to the cluster slow node monitoring database, searching historical related data corresponding to the slow nodes through the identification of the slow nodes;
and determining the reason for forming the slow node according to the historical related data and the slow node analysis result report.
In order to implement the data analysis method, the embodiment provides an electronic device. Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic apparatus 50 of the present embodiment includes: at least one processor 501 and memory 502; memory 502 for storing computer execution instructions; at least one processor 501 for executing computer-executable instructions stored in the memory to implement the various steps performed in the above-described embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
An embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the data analysis method as described above is implemented.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when executed by a processor, the computer program implements the data analysis method as described above.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form. In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (in english: processor) to execute some steps of the methods described in the embodiments of the present application. It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of hardware and software modules.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus. The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method of data analysis, the method comprising:
collecting data indexes of each node in a target cluster; the data indicators include at least one of: node task state indexes, network flow and operating system indexes;
predicting the fault probability of each node through a machine learning model according to the data indexes of each node;
and determining the slow nodes in the target cluster according to the predicted fault probability of each node.
2. The method of claim 1, wherein the node task state metrics comprise: the task quantity, the resource information occupied by the task, the task execution time, the propagation delay, the transmission delay and the queuing delay; the network flow is collected by bypass network monitoring equipment; the operating system metrics include: instruction response time, network time protocol error time, transaction amount per second, CPU load, memory utilization rate, disk capacity ratio, disk load and node file number;
the predicting and obtaining the fault probability of each node through a machine learning model according to the data indexes of each node comprises the following steps:
detecting an outlier and an invalid value in a data index of each node through an iterative self-organizing data analysis algorithm, and eliminating the outlier and the invalid value to obtain target data;
performing correlation analysis on the features in the data indexes, determining the correlation among the features in the data indexes, and taking the correlation as a feature correlation index;
and inputting the target data and the characteristic correlation index into the machine learning model, and predicting to obtain the fault probability of the node.
3. The method of claim 2, further comprising:
periodically collecting data indexes of each node in each cluster in a plurality of clusters, and taking the data indexes of each node in each cluster as historical indexes;
preprocessing the historical indexes aiming at each node in each cluster to obtain target historical data, and generating historical characteristic association indexes according to the historical indexes;
acquiring a historical analysis result of whether each node in each cluster is a slow node;
and taking the historical indexes of the nodes in each cluster, the corresponding historical characteristic correlation indexes and historical analysis results as samples, and training a machine learning model through the samples.
4. The method of any one of claims 1-3, wherein the target cluster is at least one cluster; the method further comprises the following steps:
storing periodically acquired data indexes of each node in a target cluster, data index acquisition time, feature association indexes, metadata of the node and corresponding fault probability;
and generating a cluster slow node monitoring database according to the data indexes of all nodes in the target cluster, the data index acquisition time, the characteristic association indexes, the metadata of the nodes and the corresponding fault probability, so as to support the unified management of the node data and the retroactive investigation of the analysis result containing the fault probability.
5. The method of claim 4, wherein determining the slow nodes in the target cluster based on the predicted failure probability of each of the nodes comprises:
for each node, if the predicted failure probability of the node is greater than or equal to a preset probability threshold, determining the node as a slow node in the target cluster;
correspondingly, the method further comprises the following steps:
triggering alarm operation aiming at the slow node, and counting alarm times;
and generating a slow node analysis result report according to the data index acquisition time, the alarming times and the data index corresponding to the slow node, so as to support the judgment of whether node isolation and repair are carried out.
6. The method of claim 5, further comprising:
according to the cluster slow node monitoring database, searching historical related data corresponding to the slow nodes through the identification of the slow nodes;
and determining the reason for forming the slow node according to the historical related data and the slow node analysis result report.
7. A data analysis apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring data indexes of each node in the target cluster; the data indicator includes at least one of: node task state index, network flow and operating system index;
the prediction module is used for predicting the fault probability of each node through a machine learning model according to the data indexes of each node;
and the data analysis module is used for determining the slow nodes in the target cluster according to the predicted fault probability of each node.
8. An electronic device, comprising: at least one processor and a memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the data analysis method of any of claims 1 to 6.
9. A computer-readable storage medium having computer-executable instructions stored therein which, when executed by a processor, implement the data analysis method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the data analysis method according to any one of claims 1 to 6.
CN202211182174.5A 2022-09-27 2022-09-27 Data analysis method, device, equipment, storage medium and program product Pending CN115543671A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211182174.5A CN115543671A (en) 2022-09-27 2022-09-27 Data analysis method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211182174.5A CN115543671A (en) 2022-09-27 2022-09-27 Data analysis method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN115543671A true CN115543671A (en) 2022-12-30

Family

ID=84730295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211182174.5A Pending CN115543671A (en) 2022-09-27 2022-09-27 Data analysis method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN115543671A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992235A (en) * 2023-08-09 2023-11-03 哈尔滨天君科技有限公司 Big data analysis system and method for computer parallelization synchronization

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992235A (en) * 2023-08-09 2023-11-03 哈尔滨天君科技有限公司 Big data analysis system and method for computer parallelization synchronization

Similar Documents

Publication Publication Date Title
KR101984730B1 (en) Automatic predicting system for server failure and automatic predicting method for server failure
CN108923952B (en) Fault diagnosis method, equipment and storage medium based on service monitoring index
KR102522005B1 (en) Apparatus for VNF Anomaly Detection based on Machine Learning for Virtual Network Management and a method thereof
US10031829B2 (en) Method and system for it resources performance analysis
CN108052528A (en) A kind of storage device sequential classification method for early warning
CN113254255B (en) Cloud platform log analysis method, system, device and medium
CN112148561B (en) Method and device for predicting running state of business system and server
CN111611146B (en) Micro-service fault prediction method and device
Di et al. Exploring properties and correlations of fatal events in a large-scale hpc system
CN109857618B (en) Monitoring method, device and system
CN112465237B (en) Fault prediction method, device, equipment and storage medium based on big data analysis
CN114327964A (en) Method, device, equipment and storage medium for processing fault reasons of service system
CN112433928A (en) Fault prediction method, device, equipment and storage medium of storage equipment
CN114356734A (en) Service abnormity detection method and device, equipment and storage medium
CN115543671A (en) Data analysis method, device, equipment, storage medium and program product
US9116804B2 (en) Transient detection for predictive health management of data processing systems
CN115509797A (en) Method, device, equipment and medium for determining fault category
CN106951360B (en) Data statistical integrity calculation method and system
Turgeman et al. Context-aware incremental clustering of alerts in monitoring systems
Gu et al. Online failure forecast for fault-tolerant data stream processing
CN111614504A (en) Power grid regulation and control data center service characteristic fault positioning method and system based on time sequence and fault tree analysis
CN115480948A (en) Hard disk failure prediction method and related equipment
CN112732517B (en) Disk fault alarm method, device, equipment and readable storage medium
CN114566964A (en) Power distribution network feeder automation control method, device, equipment and storage medium
WO2022000285A1 (en) Health index of a service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination