CN115509853A - Cluster data anomaly detection method and electronic equipment - Google Patents

Cluster data anomaly detection method and electronic equipment Download PDF

Info

Publication number
CN115509853A
CN115509853A CN202211151012.5A CN202211151012A CN115509853A CN 115509853 A CN115509853 A CN 115509853A CN 202211151012 A CN202211151012 A CN 202211151012A CN 115509853 A CN115509853 A CN 115509853A
Authority
CN
China
Prior art keywords
cluster
application
different
data
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211151012.5A
Other languages
Chinese (zh)
Inventor
陆明
张彬
聂志远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202211151012.5A priority Critical patent/CN115509853A/en
Publication of CN115509853A publication Critical patent/CN115509853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a cluster data anomaly detection method, which comprises the following steps: determining a mapping relation between the application cluster and the storage resource pool based on the configuration information of the application cluster; generating grouping information of different application clusters in different storage resource pools based on the mapping relation; aggregating key storage indexes of different application clusters based on the grouping information, and determining threshold value strategies of the different application clusters in different storage resource pools; and performing anomaly detection on the performance of a target object in a corresponding time window based on the threshold policy, wherein the target object comprises at least one of each application cluster or each application server in each application cluster. Simultaneously, this application still provides an electronic equipment.

Description

Cluster data anomaly detection method and electronic equipment
Technical Field
The present disclosure relates to data processing technologies, and in particular, to a cluster data anomaly detection method and an electronic device.
Background
In operation and maintenance activities for a storage system in a cloud platform or a virtualization platform, resource competition of the storage system often occurs due to excessive load generated by some applications or systems, and performance of the storage platform is reduced. For example, for a large data cluster, the load fluctuation of a single node is limited, but the load is too large after the cluster is accumulated, so that the observation is more difficult, and the application cluster or the data cluster caused by the abnormal performance of the storage platform cannot be identified.
Disclosure of Invention
In view of this, embodiments of the present application are expected to provide a cluster data anomaly detection method and an electronic device.
In order to achieve the purpose, the technical scheme of the application is realized as follows:
according to an aspect of the present application, a cluster data anomaly detection method is provided, including:
determining a mapping relation between the application cluster and the storage resource pool based on the configuration information of the application cluster;
generating grouping information of different application clusters in different storage resource pools based on the mapping relation;
aggregating key storage indexes of different application clusters based on the grouping information to obtain threshold value strategies of the different application clusters in different storage resource pools;
and performing anomaly detection on the performance of a target object in a corresponding time window based on the threshold strategy, wherein the target object comprises at least one of each application cluster or each application server in each application cluster.
In the foregoing solution, the performing, based on the threshold policy, an abnormal detection on the performance of the target object in the corresponding time window includes one of:
performing anomaly detection on specific storage indexes of each application cluster in a historical time window based on the threshold strategy, wherein the specific storage indexes comprise at least one of input and output bandwidth flow between each application cluster and each intra-cluster node and the number of read-write operations per second;
based on the threshold value strategy, carrying out abnormity detection on the node correlation storage indexes of each application cluster in the historical time window; the node correlation storage index comprises at least one of input and output bandwidth flow among nodes in each cluster, the number of flow nodes among nodes in each cluster and the number of links among nodes in each cluster.
In the foregoing solution, the performing anomaly detection on the node correlation storage indicator of each application cluster in the historical time window based on the threshold policy in the corresponding time window includes:
processing data of node correlation storage indexes of nodes in each cluster in a historical time window to obtain index data in each sliding window;
obtaining index correlation coefficients of different application clusters based on nodes in each cluster and the index data;
and determining the application cluster corresponding to the index correlation coefficient meeting the target condition as an abnormal application cluster based on the threshold strategy, wherein the target condition represents that the index correlation coefficient between nodes in the cluster is weak.
In the above scheme, the method further comprises:
and if the detection result indicates that the performance of at least one target object is abnormal, outputting alarm information.
In the above scheme, the step of characterizing the existence of the performance abnormality in the at least one target object by the detection result includes:
acquiring feature aggregation data of each target object in a historical time window;
if the target value of the feature aggregation data exceeds a boundary threshold value, determining that performance abnormity exists in the at least one target object;
the target value is used to reflect a data characteristic of the feature aggregation data.
In the above scheme, the aggregating key storage indicators of different application clusters based on the grouping information to obtain threshold policies of different application clusters in different storage resource pools includes:
obtaining historical load data of different application clusters in different storage resource pools based on the grouping information;
performing aggregation calculation on the historical load data to obtain boundary thresholds of different application clusters in different storage resource pools;
generating the threshold policy based on the boundary threshold.
In the above solution, before the performing, based on the threshold policy, an abnormal detection on the performance of the target object within a corresponding time window, the method further includes:
acquiring the starting time of performance abnormity of the storage resource pool;
performing anomaly detection on historical data of each application cluster in a time window corresponding to the starting time based on a boundary threshold in the threshold strategy; or, performing anomaly detection on historical data in a time window corresponding to a period of time before the starting time of each application cluster.
In the foregoing solution, the method further includes:
sequencing the performance abnormity detection of each application cluster according to the index correlation coefficient of the different application clusters;
and displaying the monitoring index data in the application cluster range corresponding to each index correlation coefficient according to the sequencing result.
In the above scheme, the method further comprises:
and outputting a change curve graph of the index correlation coefficient.
According to another aspect of the present application, there is provided an electronic apparatus, wherein the electronic apparatus includes:
the determining unit is used for determining the mapping relation between the application cluster and the storage resource pool based on the configuration information of the application cluster;
the generating unit is used for generating grouping information of different application clusters in different storage resource pools based on the mapping relation;
the aggregation unit is used for aggregating key storage indexes of different application clusters based on the grouping information and determining threshold strategies of the different application clusters in different storage resource pools;
and the detection unit is used for performing abnormal detection on the performance of a target object in a corresponding time window based on the threshold policy, wherein the target object comprises at least one of each application cluster or each application server in each application cluster.
According to the cluster data anomaly detection method and the electronic device, grouping information of different application clusters in different storage resource pools is generated through the mapping relation between the application clusters and the storage resource pools; aggregating key storage indexes of different application clusters based on the grouping information, and determining threshold value strategies of the different application clusters in different storage resource pools; and performing abnormity detection on the performance of the target object in the corresponding time window based on the threshold strategy. Therefore, in a big data cluster scene, the performance fluctuation of a single cloud hard disk is small, and when the storage platform is influenced by cluster resource competition and serious performance risk occurs, target application or nodes causing storage system abnormity can be quickly positioned.
Drawings
Fig. 1 is a first schematic view of a flow implementation of a cluster data anomaly detection method in the present application;
FIG. 2 is a schematic flow chart of a method for generating a threshold policy according to the present application;
fig. 3 is a schematic flow chart of a method for detecting cluster data anomaly in the present application;
FIG. 4 is a first schematic structural component diagram of an electronic device according to the present application;
fig. 5 is a structural schematic diagram of an electronic device in the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
As described above, in a large data cluster scenario, resource competition often occurs in a storage system due to an excessive load generated by some applications or systems, so that performance of the storage platform is reduced, and when the performance of the storage platform is abnormal, it is not possible to identify which application or data causes the abnormality. According to the scheme provided by the application, the key storage indexes of different application clusters are aggregated through the grouping information of the different application clusters in different storage resource pools, so that the threshold value strategies of the different application clusters in different storage resource pools are obtained; and performing abnormity detection on the performance of the target object in the corresponding time window based on the threshold strategy. Therefore, in a big data cluster scene, the performance fluctuation of a single cloud hard disk is small, and when the storage platform is influenced by cluster resource competition and serious performance risk occurs, target application or nodes causing storage system abnormity can be quickly positioned.
The technical solution of the present application is further described in detail with reference to the drawings and specific embodiments.
Fig. 1 is a schematic view of a first flow implementation of the cluster data anomaly detection method in the present application, where the method may be applied to an electronic device, where the electronic device may be a server and a client device, and the server includes a cloud server and an entity server. As shown in fig. 1, the method includes:
step 101, determining a mapping relation between an application cluster and a storage resource pool based on configuration information of the application cluster;
here, the electronic device may obtain configuration information of the application cluster through a cloud platform or a virtualization platform interface (or referred to as a cloud platform or a virtualization platform database), a storage system interface, and a monitoring database, and may determine a mapping relationship between the application cluster and the storage resource pool based on the configuration information.
Specifically, the electronic device may determine, through the storage system interface, a first corresponding relationship between the storage resource pool and the delivery disk (or delivery volume), and may determine, through the cloud platform or the virtualization platform interface (or the database), a second corresponding relationship between the delivery disk (or delivery volume) and the physical machine or the virtual machine, that is, on which physical machine or virtual machine the delivery disks (or delivery volumes) are deployed. The electronic device may determine a mapping relationship between a cloud hard disk and a storage resource pool in the application cluster based on the first corresponding relationship and the second corresponding relationship.
Because a plurality of application clusters can be operated on one platform, the electronic device can also obtain monitoring data through the monitoring database, and the third corresponding relation between different applications, different databases or different data platform clusters and the cloud hard disk can be determined according to the monitoring data.
The electronic device may determine a mapping relationship of the application cluster and the storage resource pool based on the first correspondence, the second correspondence, and the third correspondence.
102, generating grouping information of different application clusters in different storage resource pools based on the mapping relation;
in the application, the electronic device may group different application clusters in different storage resource pools based on mapping relationships between the different application clusters and the storage resource pools, so as to generate group information of the different application clusters in the different storage resource pools.
Here, the grouping information includes, but is not limited to, first mapping information of each application cluster and each intra-cluster node (e.g., an Elastic Search (ES) node is a Lucene-based search server), second mapping information of each intra-cluster node and a disk, and third mapping information of a disk and a storage resource pool.
The electronic device may determine on which node each application runs based on the first mapping information, which disks/volumes are on each node based on the second mapping information, and which storage resource pool each disk/volume is committed to based on the third mapping information.
In this application, the electronic device may further determine current distribution state information and historical distribution state information of each node in each application cluster according to the grouping information. Based on the historical distribution status information and the current distribution status information, it can be determined which nodes are expanded and which nodes originally exist.
103, aggregating key storage indexes of different application clusters based on the grouping information to obtain threshold strategies of the different application clusters in different storage resource pools;
in this application, the key storage indicators include, but are not limited to, input/Output (I/O) bandwidth and the number of times of read/write Operations Per Second (IOPS).
In one implementation, the electronic device may obtain historical load data of different application clusters in different storage resource pools based on grouping information of the different application clusters in the different storage resource pools; by carrying out aggregation calculation on the historical load data, boundary thresholds of different application clusters in different storage resource pools can be obtained; the threshold policy is generated based on the boundary threshold.
For example, the electronic device may obtain I/O throughput of nodes in each application cluster, then accumulate the I/O throughput of the nodes in each application cluster to obtain total I/O throughput of each application cluster in different storage resource pools, and use the total I/O throughput as a boundary threshold of different application clusters in different storage resource pools, thereby implementing aggregation calculation of the I/O throughput of different application clusters in different storage resource pools.
And 104, performing abnormity detection on the performance of a target object in a corresponding time window based on the threshold strategy, wherein the target object comprises each application cluster or at least one of each application server in each application cluster.
In one implementation, the electronic device can perform anomaly detection on a specific storage indicator of each application cluster in a historical time window within a corresponding time window based on the threshold policy.
Here, the specific storage metrics include at least one of I/O bandwidth traffic and IOPS between each application cluster and a node within each cluster. The electronic device can obtain the I/O bandwidth and the IOPS between each application cluster and the nodes in each cluster through the gateway. And then carrying out abnormity detection on the specific storage index based on a percentile mode or a standard deviation mode.
In another implementation, the electronic device may further perform anomaly detection on the node relevance storage indicator of each application cluster in the historical time window in the corresponding time window based on the threshold policy.
Here, the node correlation storage index includes at least one of an input/output bandwidth flow between nodes in each cluster, a flow node number between nodes in each cluster, and a link number between nodes in each cluster.
Here, the inter-cluster-node traffic anomaly detection may be understood as detection of intra-cluster east-west traffic.
For example, a cloud platform or a virtualization platform runs a large data cluster such as Hadoop or elastic search, and when data rebalancing (rebalance) is performed inside the cluster, performance fluctuation of any single disk may be within a normal monitoring range, but the performance of the platform increases sharply, which may cause performance abnormality of the storage platform.
In the application, if the storage resource pool (or called as a storage platform) is abnormal, and the detection result indicates that the abnormality occurs between the nodes in the cluster in the corresponding time window, or the detection result indicates that the abnormality is detected by a manual marking threshold, the detection result is recorded as a suspected risk, and then an alarm can be given or an abnormal report can be output. Therefore, engineers can be conveniently and quickly positioned to the abnormal point, and the diagnosis speed of the abnormal point is accelerated.
The method comprises the steps of grouping different application clusters through the mapping relation between the application clusters and storage resource pools, and aggregating key storage indexes of the different application clusters according to grouping information of the different application clusters in the different storage resource pools to obtain threshold value strategies of the different application clusters in the different storage resource pools; and performing abnormity detection on the performance of the target object in the corresponding time window based on the threshold strategy. Therefore, in a big data cluster scene, the performance fluctuation of a single cloud hard disk is small, and when a storage platform is influenced by cluster resource competition and serious performance risk occurs, the problem identification speed can be increased, and a target application or node causing the storage system to be abnormal can be quickly positioned.
In the application, when the electronic device performs abnormality detection on the node correlation storage indexes of each application cluster in the historical time window in the corresponding time window based on the threshold policy, the electronic device may further process data of the node correlation storage indexes of the nodes in each cluster in the historical time window to obtain index data in each sliding window; then obtaining index correlation coefficients of different application clusters based on nodes in each cluster and each index data; establishing an index correlation coefficient graph based on the index correlation coefficient; and determining the application cluster corresponding to the index correlation coefficient meeting the target condition in the index correlation coefficient graph as an abnormal application cluster based on the threshold strategy.
Here, the target condition characterizes that the index correlation coefficient between nodes in the cluster is weak.
Here, the index correlation coefficient may refer to a correlation coefficient between the same index or correlation indexes of a plurality of nodes. The index correlation coefficient may include two layers:
1. the nodes are different, and the indexes are the same;
2. the nodes are the same or different, and the indexes are related;
it is emphasized that the index correlation coefficient characterizes a relationship of a plurality of indexes, not the correlation coefficient of one index.
In the present application, the anomaly detection is performed on the traffic correlation between nodes, and the process specifically includes:
a) Smoothing, such as resampling, exponential smoothing, kernel smoothing, or convolution smoothing, is performed on each node observation (e.g., I/O bandwidth, IOPS, etc.).
b) And performing sliding window processing on the data after the smoothing processing to obtain each sliding window. An optional step pair is obtained as an average of the sliding window.
c) And constructing each node and index data into two-dimensional data frames, wherein one data frame is used for each index and is respectively a node and node numerical values under different sliding windows.
d) Performing a correlation analysis on the matrix of c), e.g. using pearson correlation coefficients. And then the correlation coefficient of the correlation indexes among different nodes is obtained. And obtaining correlation coefficients of different node loads, and recording abnormal detection boundaries and numerical values of the correlation coefficients by using percentiles. And constructing a correlation graph according to the relationship between different nodes according to the result.
e) Using dynamic thresholds, the nodes with weaker correlations are removed from the correlation graph, and only the range of nodes with higher correlations is retained. (this step is used for computational power optimization).
In the application, the electronic device may also perform performance anomaly detection sequencing on each application cluster according to index correlation coefficients of different application clusters; and displaying the monitoring index data in the application cluster range corresponding to each index correlation coefficient according to the sequencing result.
In this application, the electronic device may further output a change curve of the index correlation coefficient.
In the application, the electronic device may specifically obtain feature aggregation data of each target object in a historical time window; if the target value of the feature aggregation data exceeds a boundary threshold value, determining that performance abnormity exists in the at least one target object; the target value is used to reflect a data characteristic of the feature aggregation data.
For example, the target value may be a peak, median, mean, or trough of the feature aggregated data. But also data with geometrical properties such as slope, curvature, radius, etc.
In the application, if the storage platform is abnormal, the electronic device is triggered to execute cluster performance abnormity detection. Here, before performing the anomaly detection on the performance of the target object within the corresponding time window based on the threshold policy, the electronic device may further obtain a start time when the performance anomaly exists in the storage resource pool; based on the boundary threshold in the threshold strategy, performing anomaly detection on historical data of each application cluster in a time window corresponding to the starting time; or, performing anomaly detection on historical data in a time window corresponding to a period of time before the starting time of each application cluster. And if the detection result indicates that the data index correlation of the currently existing relevant nodes breaks through the dynamic threshold, marking as abnormal. And an excessive amount of traffic between nodes or rebalancing may occur at this time.
In the application, if the detection result indicates that at least one target object has performance abnormity, the electronic equipment outputs alarm information.
The anomaly detection indexes include performance indexes after aggregation, and also include indexes such as the number of nodes with simultaneous transaction, the number of flow transaction links between nodes (rebalance scene), a storage platform cache, a storage platform CPU (central processing unit), and the like.
If the cluster consumption cloud hard disk transaction is the same as the storage platform transaction in time or is advanced by a certain limited time window, exception a and exception b = = true, which means that the cluster exception may cause the platform exception, an alarm or report is triggered.
According to the cluster data anomaly detection method, the storage system resource pool used for delivering the disk is obtained through the cloud platform interface or the database and the storage system interface. Different applications, databases, or big data platform cluster boundaries are identified by the monitoring system. Grouping different clusters on different storage resource pools, and performing key storage index aggregation calculation based on grouping information. And performing anomaly detection on the aggregation calculation results in a period of time, or manually marking a threshold range. And if the storage platform is abnormal and the time window is abnormal in combination with abnormal detection or manual marking threshold detection, recording the abnormal detection as suspected risk, and alarming or outputting an abnormal report. The problem identification can be accelerated when the performance fluctuation of a single cloud hard disk is small and the storage platform is influenced by cluster resource competition to cause serious performance risk, so that the end-to-end problem solving efficiency is improved. The method provided by the application has high reliability under the algorithm complexity.
Fig. 2 is a schematic flow chart of a method for generating a threshold policy in the present application, and as shown in fig. 2, the method includes:
step 201, acquiring a mapping relation between a cloud hard disk and a resource pool through a cloud platform or a virtualization platform and a storage system interface;
step 202, identifying cluster boundaries of different applications, different databases or different big data platforms through a Configuration Management Database (CMDB), a cluster configuration database or a monitoring database;
step 203, grouping different application clusters on different storage resource pools to generate grouping information of different resource pools consumed by cloud hard disks of different application clusters;
here, the grouping information includes current distribution state information and distribution history information of cloud hard disks of different application clusters.
204, aggregating historical load data under different storage resource pools based on grouping information;
step 205, taking the extreme value of the historical data in the aggregation result as a dynamic threshold boundary;
and step 206, determining the dynamic threshold of different clusters in different storage resource pools according to the dynamic threshold boundary.
Dynamic thresholds may also be referred to herein as threshold policies.
Here, the dynamic threshold range may also be manually labeled according to the aggregation result.
In this application, in the process of executing aggregation, typical aggregation indexes of the electronic device include:
I/O bandwidth and read-write IOPS.
Fig. 3 is a schematic flowchart of a second method for detecting cluster data anomalies in the present application, where as shown in fig. 3, the method includes:
step 301, determining that the performance of the storage platform is abnormal;
herein, a storage platform may also be referred to as a storage resource pool.
Step 302, identifying the abnormal starting time of the storage platform based on the CMDB, the cluster configuration database or the monitoring database;
303, acquiring historical grouping information of different storage resource pools consumed by cloud disks of different application clusters;
here, the historical grouping information of the cloud hard disk consuming different storage resource pools of different application clusters may be obtained specifically based on the CMDB, the cluster configuration database, or the monitoring database.
In step 304, feature aggregation data is obtained.
Here, the aggregated data includes historical aggregated data of cloud hard disks of different application clusters in a period of time before the platform exception starting time.
Step 305, performing dynamic threshold detection according to a threshold policy;
here, it is necessary to detect not only the abnormality occurrence time but also the history data within a time window before the abnormality occurrence time.
Step 306, if the storage platform abnormality and the cluster abnormality occur simultaneously, or the cluster abnormality occurs in a time window before the platform abnormality moment, an alarm is triggered;
and 307, outputting an abnormal report or outputting alarm information of an abnormal message.
Here, the alarm information of the abnormal message may be a report or a message interfacing to other business systems, and an output form of the alarm information is not limited as long as it can indicate that the storage platform is abnormal or the cluster is abnormal.
Fig. 4 is a schematic structural composition diagram of an electronic device in the present application, as shown in fig. 4, the electronic device includes:
a determining unit 401, configured to determine a mapping relationship between an application cluster and a storage resource pool based on configuration information of the application cluster;
a generating unit 402, configured to generate grouping information of different application clusters in different storage resource pools based on the mapping relationship;
an aggregation unit 403, configured to aggregate key storage indicators of different application clusters based on the grouping information, and determine threshold policies of the different application clusters in different storage resource pools;
a detecting unit 404, configured to perform anomaly detection on performance of a target object within a corresponding time window based on the threshold policy, where the target object includes at least one of each application cluster or each application server in each application cluster.
In a preferred embodiment, the detecting unit 404 is specifically configured to perform anomaly detection on a specific storage indicator of each application cluster in a historical time window based on the threshold policy, where the specific storage indicator includes at least one of an input/output bandwidth flow between each application cluster and each node in the cluster and a number of read/write operations per second; and/or
Based on the threshold value strategy, carrying out abnormity detection on the node correlation storage indexes of each application cluster in the historical time window; the node correlation storage index includes at least one of input/output bandwidth flow among nodes in each cluster, the number of flow nodes among nodes in each cluster, and the number of links among nodes in each cluster.
In a preferred embodiment, the electronic device further includes: a processing unit 405 and a setup unit 406;
the processing unit 405 is configured to process data of node correlation storage indexes of nodes in each cluster in a historical time window to obtain index data in each sliding window;
the determining unit 401 is further configured to obtain index correlation coefficients of different application clusters based on nodes in each cluster and each index data;
an establishing unit 406, configured to establish an index correlation coefficient map based on the index correlation coefficient;
the determining unit 401 is further configured to determine, based on the threshold policy, an application cluster corresponding to an index correlation coefficient that meets a target condition in the index correlation coefficient map as an abnormal application cluster, where the target condition indicates that an index correlation coefficient between nodes in a cluster is weak.
In a preferred embodiment, the output unit 407 is configured to output alarm information if the detection result indicates that at least one target object has performance abnormality.
In a preferred embodiment, the electronic device further includes:
an obtaining unit 408, configured to obtain feature aggregation data of each target object in a historical time window;
a determining unit 401, configured to determine that there is a performance anomaly in the at least one target object if the target value of the feature aggregation data exceeds a boundary threshold;
the target value is used to reflect a data characteristic of the feature aggregation data.
In a preferred embodiment, the determining unit 401 is further configured to obtain historical load data of different application clusters in different storage resource pools based on the grouping information;
an aggregation unit 403, specifically configured to perform aggregation calculation on the historical load data to obtain boundary thresholds of different application clusters in different storage resource pools;
a generating unit 402, configured to generate the threshold policy based on the boundary threshold.
In a preferred embodiment, the obtaining unit 408 is further configured to obtain a starting time when the performance of the storage resource pool is abnormal;
a detecting unit 404, configured to perform anomaly detection on historical data of each application cluster in a time window corresponding to the start time based on a boundary threshold in the threshold policy; or, performing anomaly detection on historical data in a time window corresponding to a period of time before the starting time of each application cluster.
In a preferred embodiment, the electronic device further includes:
a sorting unit 409, configured to sort the performance anomaly detection of each application cluster according to the index correlation coefficients of the different application clusters;
and the display unit 410 is configured to present the monitoring index data in the application cluster range corresponding to each index correlation coefficient according to the sorting result.
Preferably, the output unit 407 is further configured to output a variation graph of the index correlation coefficient.
And the operation and maintenance engineer can conveniently accelerate the decision based on the change curve graph.
According to the scheme provided by the application, the key storage indexes of different application clusters are aggregated through the grouping information of the different application clusters in different storage resource pools, so that threshold value strategies of the different application clusters in the different storage resource pools are obtained; and performing anomaly detection on the performance of the target object in the corresponding time window based on the threshold strategy. Therefore, in a big data cluster scene, the performance fluctuation of a single cloud hard disk is small, and when the storage platform is influenced by cluster resource competition and serious performance risk occurs, target application or nodes causing storage system abnormity can be quickly positioned.
It should be noted that: in the foregoing embodiment, when detecting an abnormal cluster data, the electronic device provided in the foregoing embodiment only exemplifies the division of the program modules, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the electronic device is divided into different program modules to complete all or part of the processing described above. In addition, the electronic device provided by the embodiment and the embodiment of the cluster data anomaly detection method provided by the embodiment belong to the same concept, and specific implementation processes of the electronic device and the embodiment of the cluster data anomaly detection method are described in the embodiment of the method for detail, and are not described again here.
An embodiment of the present application further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor,
wherein the processor is configured to perform any one of the method steps of the above-mentioned processing method when running the computer program.
Fig. 5 is a schematic structural component diagram of an electronic device 500 in the present application, where the electronic device may be a mobile phone, a computer, a digital broadcast terminal, an information transceiver device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or other terminals. The electronic device 500 shown in fig. 5 includes: at least one processor 501, memory 502, at least one network interface 504, and a user interface 503. The various components in the electronic device 500 are coupled together by a bus system 505. It is understood that the bus system 505 is used to enable connection communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 505 in FIG. 5.
The user interface 503 may include a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, a touch screen, or the like, among others.
It will be appreciated that the memory 502 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), synchronous Dynamic Random Access Memory (SLDRAM), direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 502 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 502 in embodiments of the present application is used to store various types of data to support the operation of the electronic device 500. Examples of such data include: any computer programs for operating on the electronic device 500, such as an operating system 5021 and application programs 5022; contact data; telephone book data; a message; a picture; audio, etc. The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 5022 may contain various applications such as a Media Player (Media Player), a Browser (Browser), etc. for implementing various application services. A program for implementing the method according to the embodiment of the present application may be included in the application 5022.
The method disclosed in the embodiments of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in software form in the processor 501. The Processor 501 may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The processor 501 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 502, and the processor 501 reads the information in the memory 502 and performs the steps of the aforementioned methods in conjunction with its hardware.
In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.
In an exemplary embodiment, the present application further provides a computer readable storage medium, such as the memory 502 comprising a computer program, which can be executed by the processor 501 of the electronic device 500 to perform the steps of the foregoing method. The computer readable storage medium can be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of the above-mentioned processing methods.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or in other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to arrive at new method embodiments.
The features disclosed in the several product embodiments presented in this application can be combined arbitrarily, without conflict, to arrive at new product embodiments.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A cluster data anomaly detection method comprises the following steps:
determining a mapping relation between the application cluster and the storage resource pool based on the configuration information of the application cluster;
generating grouping information of different application clusters in different storage resource pools based on the mapping relation;
aggregating key storage indexes of different application clusters based on the grouping information, and determining threshold value strategies of the different application clusters in different storage resource pools;
and performing anomaly detection on the performance of a target object in a corresponding time window based on the threshold strategy, wherein the target object comprises at least one of each application cluster or each application server in each application cluster.
2. The method of claim 1, wherein the detecting of anomalies in the performance of the target object within the corresponding time window based on the threshold policy comprises one of:
performing anomaly detection on specific storage indexes of each application cluster in a historical time window based on the threshold strategy, wherein the specific storage indexes comprise at least one of input and output bandwidth flow and the number of read-write operations per second between each application cluster and each intra-cluster node;
based on the threshold value strategy, carrying out abnormity detection on the node correlation storage indexes of each application cluster in the historical time window; the node correlation storage index comprises at least one of input and output bandwidth flow among nodes in each cluster, the number of flow nodes among nodes in each cluster and the number of links among nodes in each cluster.
3. The method of claim 2, wherein the detecting anomalies in the node relevance storage metrics of each application cluster within the historical time window within the corresponding time window based on the threshold policy comprises:
processing data of node correlation storage indexes of nodes in each cluster in a historical time window to obtain index data in each sliding window;
obtaining index correlation coefficients of different application clusters based on nodes in each cluster and the index data;
and determining the application cluster corresponding to the index correlation coefficient meeting the target condition as an abnormal application cluster based on the threshold strategy, wherein the target condition represents that the index correlation coefficient between nodes in the cluster is weak.
4. The method of claim 1, wherein the method further comprises:
and if the detection result represents that at least one target object has abnormal performance, outputting alarm information.
5. The method of claim 4, wherein the detection result characterizes the presence of a performance anomaly in at least one target object, comprising:
acquiring feature aggregation data of each target object in a historical time window;
if the target value of the feature aggregation data exceeds a boundary threshold value, determining that performance abnormity exists in the at least one target object;
the target value is used to reflect a data characteristic of the feature aggregation data.
6. The method of claim 1, wherein the aggregating key storage metrics of different application clusters based on the grouping information to obtain threshold policies of different application clusters in different storage resource pools comprises:
obtaining historical load data of different application clusters in different storage resource pools based on the grouping information;
performing aggregation calculation on the historical load data to obtain boundary thresholds of different application clusters in different storage resource pools;
generating the threshold policy based on the boundary threshold.
7. The method of claim 1, wherein prior to the anomaly detecting performance of a target object within a corresponding time window based on the threshold policy, the method further comprises:
acquiring the starting time of the storage resource pool with performance abnormity;
performing anomaly detection on historical data of each application cluster in a time window corresponding to the starting time based on a boundary threshold in the threshold strategy; or, performing anomaly detection on historical data in a time window corresponding to a period of time before the starting time of each application cluster.
8. The method of claim 3, wherein the method further comprises:
sequencing the performance anomaly detection of the application clusters according to the index correlation coefficients of the different application clusters;
and displaying the monitoring index data in the application cluster range corresponding to each index correlation coefficient according to the sequencing result.
9. The method of claim 3, wherein the method further comprises:
and outputting a change curve graph of the index correlation coefficient.
10. An electronic device, wherein the electronic device comprises:
the determining unit is used for determining the mapping relation between the application cluster and the storage resource pool based on the configuration information of the application cluster;
the generating unit is used for generating grouping information of different application clusters in different storage resource pools based on the mapping relation;
the aggregation unit is used for aggregating key storage indexes of different application clusters based on the grouping information and determining threshold strategies of the different application clusters in different storage resource pools;
and the detection unit is used for carrying out abnormal detection on the performance of a target object in a corresponding time window based on the threshold value strategy, wherein the target object comprises at least one of each application cluster or each application server in each application cluster.
CN202211151012.5A 2022-09-21 2022-09-21 Cluster data anomaly detection method and electronic equipment Pending CN115509853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211151012.5A CN115509853A (en) 2022-09-21 2022-09-21 Cluster data anomaly detection method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211151012.5A CN115509853A (en) 2022-09-21 2022-09-21 Cluster data anomaly detection method and electronic equipment

Publications (1)

Publication Number Publication Date
CN115509853A true CN115509853A (en) 2022-12-23

Family

ID=84504319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211151012.5A Pending CN115509853A (en) 2022-09-21 2022-09-21 Cluster data anomaly detection method and electronic equipment

Country Status (1)

Country Link
CN (1) CN115509853A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012718A (en) * 2024-04-02 2024-05-10 北京大道云行科技有限公司 Real-time monitoring method for distributed storage system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012718A (en) * 2024-04-02 2024-05-10 北京大道云行科技有限公司 Real-time monitoring method for distributed storage system

Similar Documents

Publication Publication Date Title
US20190286510A1 (en) Automatic correlation of dynamic system events within computing devices
CN107025153B (en) Disk failure prediction method and device
US20190095266A1 (en) Detection of Misbehaving Components for Large Scale Distributed Systems
US10809936B1 (en) Utilizing machine learning to detect events impacting performance of workloads running on storage systems
CN110471821B (en) Abnormality change detection method, server, and computer-readable storage medium
CN109976971B (en) Hard disk state monitoring method and device
CN116049146B (en) Database fault processing method, device, equipment and storage medium
CN109992473A (en) Monitoring method, device, equipment and the storage medium of application system
CN113837596A (en) Fault determination method and device, electronic equipment and storage medium
CN116010220A (en) Alarm diagnosis method, device, equipment and storage medium
CN110807050B (en) Performance analysis method, device, computer equipment and storage medium
CN115509853A (en) Cluster data anomaly detection method and electronic equipment
CN108667740A (en) The method, apparatus and system of flow control
CN116668264A (en) Root cause analysis method, device, equipment and storage medium for alarm clustering
CN115580528A (en) Fault root cause positioning method, device, equipment and readable storage medium
CN109144816A (en) A kind of node health degree detection method and system
CN114760190A (en) Service-oriented converged network performance anomaly detection method
US20210208962A1 (en) Failure detection and correction in a distributed computing system
Bayram et al. Improving reliability with dynamic syndrome allocation in intelligent software defined data centers
CN116057902A (en) Health index of service
CN108810230B (en) Method, device and equipment for acquiring incoming call prompt information
CN111581044A (en) Cluster optimization method, device, server and medium
US20190018723A1 (en) Aggregating metric scores
CN111815442B (en) Link prediction method and device and electronic equipment
CN115174667B (en) Big data pushing method, system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination