WO2022257421A1 - 集群异常检测方法、装置和相关设备 - Google Patents
集群异常检测方法、装置和相关设备 Download PDFInfo
- Publication number
- WO2022257421A1 WO2022257421A1 PCT/CN2021/140203 CN2021140203W WO2022257421A1 WO 2022257421 A1 WO2022257421 A1 WO 2022257421A1 CN 2021140203 W CN2021140203 W CN 2021140203W WO 2022257421 A1 WO2022257421 A1 WO 2022257421A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- log
- category
- log data
- vector
- log category
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 88
- 239000013598 vector Substances 0.000 claims abstract description 109
- 239000011159 matrix material Substances 0.000 claims abstract description 77
- 238000000034 method Methods 0.000 claims abstract description 55
- 238000000605 extraction Methods 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims description 43
- 230000006870 function Effects 0.000 claims description 34
- 230000002159 abnormal effect Effects 0.000 claims description 29
- 230000005856 abnormality Effects 0.000 claims description 16
- 230000004927 fusion Effects 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 11
- 238000012423 maintenance Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000000644 propagated effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates to the technical field of computers and the Internet, and in particular to a cluster anomaly detection method and device, electronic equipment, and a computer-readable storage medium.
- the purpose of the present disclosure is to provide a cluster anomaly detection method, device, electronic device, and computer-readable storage medium, which can quickly and effectively perform anomaly detection on nodes in the cluster.
- An embodiment of the present disclosure provides a cluster anomaly detection method, including: obtaining multiple pieces of log data and multiple performance indicators from the target node in the cluster; clustering the multiple pieces of log data to determine the the log category of log data; generate the log category matrix of the target node according to the log category of each piece of log data; perform feature extraction on the log category matrix through the abnormal detection model to obtain a log category vector; through the abnormal The detection model performs feature extraction on the multiple performance indicators to obtain a performance indicator vector; through the abnormality detection model, the log category vector and the performance indicator vector are vector fused to obtain the node characteristics of the target node vector; performing classification processing on the node feature vector through the anomaly detection model, so as to determine the predicted anomaly type of the target node in the cluster.
- the target node includes a first node and a second node
- the plurality of log data includes a plurality of first log data from the first node and a plurality of second node log data from the second node
- the log category matrix includes a category dimension
- generating the log category matrix of the target node according to the log category of each piece of log data includes: determining the log category corresponding to each piece of first log data, and according to each piece of first
- the log category corresponding to the log data generates the first log category sequence; determines the log category corresponding to each second log data, and generates the second log category sequence according to the log category corresponding to each second log data; according to the category dimension
- the first log category sequence and the second log category are concatenated to generate a log category matrix of the target node.
- clustering the multiple pieces of log data to determine the log category of each piece of log data includes: determining the high-frequency words and occurrences in the multiple pieces of log data whose occurrence times are greater than the target number of times threshold Non-high-frequency words whose times are less than or equal to the target number of times threshold; keep the high-frequency words in the multiple pieces of log data unchanged and perform placeholder processing on the non-high-frequency words to obtain multiple log trunks; according to The plurality of log trunks perform clustering processing on the plurality of log data to determine a plurality of log clusters; and determine the log category of the log data in each log cluster.
- keeping the high-frequency words in the multiple pieces of log data unchanged and performing placeholder processing on the non-high-frequency words, so as to obtain multiple pieces of log trunks includes: adding the multiple pieces of log data The non-high-frequency words whose probability of appearing simultaneously with the high-frequency words is greater than the preset probability threshold are used as high-frequency associated words; the high-frequency associated words are removed from the non-high-frequency words; The high-frequency words and the high-frequency associated words remain unchanged, and the non-high-frequency words are subjected to placeholder processing to obtain multiple log trunks.
- the multiple pieces of log data include multiple pieces of third log data collected in the first time period and multiple pieces of fourth log data collected in the second time period, and the log category matrix includes a time dimension; wherein, Generating the log category matrix of the target node according to the log category of each piece of log data includes: determining the log category corresponding to each piece of third log data, and generating a third log category sequence according to the log category corresponding to each piece of third log data ; Determine the log category corresponding to each piece of fourth log data, and generate a fourth log category sequence according to the log category corresponding to each piece of fourth log data; pair the third log category sequence and the fourth log category sequence according to the time dimension The log category sequence is spliced to generate the log category matrix of the target node.
- performing feature extraction on the log category matrix through the abnormality detection model to obtain a log category vector includes: performing convolution processing on the log category matrix to obtain a log category convolution feature matrix; Perform pooling processing on the log category convolution feature matrix to obtain the log category vector.
- the predicted anomaly type includes multiple predicted anomaly types; wherein, the method further includes: acquiring multiple anomaly type tags of the target node; according to the multiple predicted anomaly types and the multiple Each abnormal type label determines the loss function value corresponding to each predicted abnormal type; the loss function value is normalized according to the value of each predicted abnormal type to obtain a normalized loss function value; through the normalized loss The function value trains the anomaly detection model.
- An embodiment of the present disclosure provides a cluster anomaly detection device, including: a log data acquisition module, a log category determination module, a log category matrix determination module, a log category vector generation module, a performance index vector acquisition module, a node feature vector determination module, and a prediction module.
- the log data acquisition module is used to obtain multiple pieces of log data and multiple performance indicators from the target nodes in the cluster;
- the log category determination module is used to cluster the multiple pieces of log data, to Determine the log category of each log data;
- the log category matrix determination module is used to generate the log category matrix of the target node according to the log category of each log data;
- the log category vector generation module is used to pass the abnormal detection
- the model performs feature extraction on the log category matrix to obtain a log category vector;
- the performance indicator vector acquisition module is used to perform feature extraction on the multiple performance indicators through the anomaly detection model to obtain a performance indicator vector;
- the node feature vector determination module is used to perform vector fusion of the log category vector and the performance index vector through the abnormal detection model to obtain the node feature vector of the target node;
- the prediction module is used to use the
- the anomaly detection model classifies the node feature vector to determine the predicted anomaly type of the target node in the cluster.
- An embodiment of the present disclosure proposes an electronic device, which includes: one or more processors; a storage device for storing one or more programs, when the one or more programs are processed by the one or more The processor is executed, so that the one or more processors implement the cluster anomaly detection method described in any one of the above.
- An embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the cluster anomaly detection method described in any one of the foregoing is implemented.
- An embodiment of the present disclosure provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
- the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the cluster anomaly detection method described above.
- the anomaly detection method, device, electronic equipment, and computer-readable storage medium provided by the embodiments of the present disclosure, on the one hand, simply and conveniently complete the anomaly detection of the target node in the cluster through the log data and performance data of the target node;
- feature extraction and classification are performed on each piece of log data and performance indicators of the target node through the anomaly detection model, and the anomaly type of the target node in the cluster is determined efficiently and accurately; in addition, before the feature extraction of the target node, It also classifies each piece of log data through clustering processing, and then performs feature extraction on the categories of each piece of data such as days through the anomaly detection model.
- the log data is processed, which improves the data processing efficiency.
- Fig. 1 shows a schematic diagram of an exemplary system architecture of a cluster anomaly detection method or a cluster anomaly detection apparatus applied to an embodiment of the present disclosure.
- Fig. 2 is a flowchart of a cluster anomaly detection method in an exemplary embodiment of the present disclosure.
- Fig. 3 is a schematic diagram of a data vectorization method according to an exemplary embodiment.
- Fig. 4 is a flowchart showing a method for determining a log category matrix according to an exemplary embodiment.
- Fig. 5 is a flowchart showing a method for determining a log category according to an exemplary embodiment.
- Fig. 6 is a flow chart showing a method for determining a log category matrix according to an exemplary embodiment.
- Fig. 7 is a schematic diagram of a network structure of an anomaly detection model according to an exemplary embodiment.
- Fig. 8 shows a block diagram of a cluster anomaly detection device according to an exemplary embodiment.
- Fig. 9 shows a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server according to an embodiment of the present disclosure.
- Example embodiments will now be described more fully with reference to the accompanying drawings.
- Example embodiments may, however, be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
- the same reference numerals denote the same or similar parts in the drawings, and thus their repeated descriptions will be omitted.
- the terms “a”, “an”, “the”, “said” and “at least one” are used to indicate the presence of one or more elements/components/etc.; the terms “comprising”, “including” and “Having” is used to indicate an open-ended inclusive meaning and means that there may be additional elements/components/etc. in addition to the listed elements/components/etc.; the terms “first”, “second “ and “Third” etc. are used only as marks, not as restrictions on the number of their objects.
- Fig. 1 shows a schematic diagram of an exemplary system architecture of a cluster anomaly detection method or a cluster anomaly detection apparatus that can be applied to an embodiment of the present disclosure.
- a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
- the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
- Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
- terminal devices 101 , 102 , 103 Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like.
- the terminal devices 101, 102, 103 can be various electronic devices with display screens and supporting web browsing, including but not limited to smartphones, tablet computers, laptop computers, desktop computers, wearable devices, virtual reality devices , smart home and more.
- the server 105 may be a server that provides various services, for example, a background management server that provides support for devices operated by users using the terminal devices 101 , 102 , 103 .
- the background management server can analyze and process the received data such as requests, and feed back the processing results to the terminal device.
- the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, intermediate This disclosure does not limit this.
- the server 105 may, for example, obtain multiple pieces of log data and multiple performance indicators from the target nodes in the cluster; the server 105 may, for example, cluster the multiple pieces of log data to determine the log category of each piece of log data; 105 may, for example, generate the log category matrix of the target node according to the log category of each piece of log data; the server 105 may, for example, perform feature extraction on the log category matrix through the anomaly detection model to obtain a log category vector; the server 105 may For example, feature extraction is performed on the multiple performance indicators through the abnormality detection model to obtain a performance indicator vector; the server 105 may, for example, perform vector fusion of the log category vector and the performance indicator vector through the abnormality detection model, Obtaining the node feature vector of the target node; the server 105 may
- the numbers of terminal devices, networks and servers in FIG. 1 are only illustrative, and the server 105 may be a physical server, or may be composed of multiple servers. According to actual needs, there may be any number of terminal devices, network and server.
- Fig. 2 is a flow chart showing a cluster anomaly detection method according to an exemplary embodiment.
- the method provided by the embodiments of the present disclosure can be performed by any electronic device with computing and processing capabilities.
- the method can be performed by the server or the terminal device in the above embodiment in FIG. 1 , or can be performed jointly by the server and the terminal device.
- a server is used as an example for illustration, but the disclosure is not limited thereto.
- a cluster is a group of computers that provide users with a set of network resources as a whole, and these individual computers are nodes (nodes) of the cluster.
- the present disclosure will take the Ceph cluster (a unified distributed storage system) as an example for explanation, but the present disclosure does not limit this.
- cluster anomaly detection includes anomaly detection of nodes in the cluster.
- the cluster anomaly detection method provided by the embodiment of the present disclosure may include the following steps.
- Step S202 acquiring multiple pieces of log data and multiple performance indicators from the target node in the cluster.
- the target node can be a physical node such as any computer in the cluster, or a functional node such as an OSD (Object Storage Device) node or a MON (Monitor, monitoring) node, and this disclosure does not limit it.
- OSD Object Storage Device
- MON Monitoring, monitoring
- the present disclosure will take the target node as an OSD node as an example for description, but the present disclosure does not limit this.
- the OSD node can be the object storage and search process of the cluster, which can be responsible for storing objects on the local file system and providing access to these objects through the network.
- the MON node can be the manager of the cluster state and maintain the state of the entire cluster.
- Multiple performance indicators may include but are not limited to CPU (Central Processing Unit, central processing unit) utilization, memory utilization, swap memory utilization, disk IO ((Input/Output), input/output) read and write speed, data packet Indicators related to node performance, such as sending and receiving volume.
- CPU Central Processing Unit, central processing unit
- memory utilization volatile and non-volatile memory
- swap memory utilization volatile and non-volatile memory
- disk IO (Input/Output), input/output) read and write speed
- data packet Indicators related to node performance, such as sending and receiving volume.
- the target node can generate log data in real time.
- the multiple pieces of log data obtained in the present disclosure may include log data obtained from multiple target nodes at the same time, or may include log data obtained from the same node at different times, which is not limited in the present disclosure.
- the multiple performance indicators obtained in the present disclosure may include multiple performance indicators obtained from multiple target nodes at the same time, or may include multiple performance indicators obtained from the same node at different times, and this disclosure does not Do limit.
- Step S204 clustering the multiple pieces of log data to determine the log category of each piece of log data.
- the logs may be clustered according to the log form, log content, etc., so as to divide the logs into multiple clusters, and then assign the same log category to the logs in each cluster.
- Step S206 generating a log category matrix of the target node according to the log category of each piece of log data.
- the log category sequences of the log data of the same target node may be arranged according to a certain direction, for example, the log category sequences of the log data of the same node are arranged by row.
- the log category sequences of the logs may be arranged in another direction. For example, arrange the sequence of log categories of different target nodes into columns.
- the disclosure does not limit the method for generating the log category matrix, and those skilled in the art can make adjustments according to requirements.
- Step S208 perform feature extraction on the log category matrix through an abnormality detection model to obtain a log category vector.
- the log sequence extracted from the target node is discontinuous one-hot (one-bit effective encoding) data, and we also need to use the word embedding (Embedding) method to convert it into a continuous vector.
- the Item2Vec a bag-of-words model
- a random N N is an integer greater than or equal to 1, such as 50
- a length M M is greater than or equal to 1
- dimensional vector can be opened on the sequence.
- Integers such as 10
- windows take positive examples between the classes in the window, shorten the distance between their vectors, randomly take some classes outside the window as negative examples, and make their vectors farther away.
- the distance between these vectors reflects the timing information between the various classes.
- the longest sequence segment be Lmax
- the time sequence segment on each target OSD is a matrix of (50, Lmax, splicing the data of n OSDs along the first dimension (such as the row dimension) to form (50 ⁇ n, Lmax) matrix
- n is an integer greater than or equal to 1.
- the anomaly detection model can be any network model that can perform feature extraction and classification, such as a convolutional neural network CNN, or a recurrent neural network RNN, etc. This disclosure does not Do limit.
- Performing feature extraction on the log category matrix by an abnormality detection model may include: performing convolution processing on the log category matrix to obtain a log category convolution feature matrix; performing pooling processing on the log category convolution feature matrix to obtain The log category vector.
- Step S210 performing feature extraction on the multiple performance indicators through the abnormality detection model to obtain a performance indicator vector.
- the feature extraction process of the performance index is similar to the feature extraction process of the log category matrix, which is not limited in the present disclosure.
- Step S212 performing vector fusion of the log category vector and the performance index vector through the anomaly detection model to obtain a node feature vector of the target node.
- the fusion of the log category vector and the performance indicator vector can be completed according to the dimension where the log category is located.
- the fusion of the log category vector and the performance indicator vector can be completed on the row dimension. limit.
- vector fusion may be performed after feature extraction, or information fusion may be performed before feature extraction, so as to fuse log category information and performance index information.
- Step S214 performing classification processing on the node feature vector through the anomaly detection model to determine the predicted anomaly type of the target node in the cluster.
- the predicted anomaly category may include one or multiple categories, which is not limited in the present disclosure.
- the types of predicted exceptions may include network disconnection exceptions, CPU full exceptions, memory full exceptions, etc., which are not limited in the present disclosure.
- the probability of occurrence of each predicted anomaly category can be obtained, for example, 90% of network disconnection anomalies, 9% of CPU full anomalies, and 1% of memory full anomalies.
- a certain threshold for example, 60%
- the abnormal position of the target node can be located according to the corresponding log data when the abnormality occurs, so as to perform maintenance and processing, etc., and this disclosure does not limit this.
- the technical solution improved by the embodiments of the present disclosure on the one hand, through the log data and performance data of the target node, simply and conveniently completes the anomaly detection of the target node in the cluster; Feature extraction and classification are performed on log data and various performance indicators, and the abnormal type of the target node in the cluster is determined efficiently and accurately; Classify the categories, and then use the anomaly detection model to extract the features of each category of data such as days.
- This method reduces the amount of data for feature extraction, facilitates the processing of a large amount of log data, and improves the data processing efficiency. .
- Fig. 4 is a flowchart showing a method for determining a log category matrix according to an exemplary embodiment.
- the target node may include a first node and a second node
- the plurality of log data includes a plurality of first log data from the first node and a plurality of second node log data from the second node
- the The log category matrix includes category dimensions.
- the category dimension may refer to the dimension of the log category arrangement of each piece of log data of a single node. For example, if the log category of each piece of log data of each node is arranged in columns, then the category dimension may be the column dimension.
- the above-mentioned method for determining a log category matrix may include the following steps.
- Step S402 determining the log category corresponding to each piece of first log data, and generating a first log category sequence according to the log category corresponding to each piece of first log data.
- the first log category corresponding to each piece of first log data may be determined through clustering processing on all log data of the target node, so as to generate the first log category sequence.
- Step S404 determining the log category corresponding to each piece of second log data, and generating a second log category sequence according to the log category corresponding to each piece of second log data.
- the second log category corresponding to each piece of second log data may be determined through clustering processing on all log data of the target node, so as to generate a second log category sequence.
- Step S406 performing concatenation processing on the first log category sequence and the second log category according to the category dimension, so as to generate a log category matrix of the target node.
- the first log category sequence and the second log category sequence may be spliced according to category dimensions to generate the log category matrix.
- Fig. 5 is a flowchart showing a method for determining a log category matrix according to an exemplary embodiment.
- the multiple pieces of log data may include multiple pieces of third log data collected in the first time period and multiple pieces of fourth log data collected in the second time period, and the log category matrix may include a time dimension.
- the above-mentioned method for determining a log category matrix may include the following steps.
- Step S502 determining the log category corresponding to each piece of third log data, and generating a third log category sequence according to the log category corresponding to each piece of third log data.
- Step S504 determining the log category corresponding to each piece of fourth log data, and generating a fourth log category sequence according to the log category corresponding to each piece of fourth log data.
- Step S506 performing concatenation processing on the third log category sequence and the fourth log category sequence according to the time dimension, so as to generate a log category matrix of the target node.
- the time dimension can refer to the dimension in which the log categories of multiple log data in a single node are arranged according to time. For example, if the log categories of log data in different times of each node are arranged in rows, then the time dimension can be the row dimension.
- the method for determining the log category matrix improved in the foregoing embodiments may fuse log categories of log data of different nodes, or fuse log categories of log data at different times.
- the technical solution provided by this embodiment by extracting the features of the log category of the log data, not only the abnormal type of the target node can be accurately predicted, but also the amount of feature extraction data is greatly reduced compared with the feature extraction of the log data itself. Computational resources are saved.
- Fig. 6 is a flow chart showing a method for determining a log category according to an exemplary embodiment.
- the above method for determining a log category may include the following steps.
- Step S602 determining the high-frequency words whose occurrence frequency is greater than the target frequency threshold and the non-high-frequency words whose occurrence frequency is less than or equal to the target frequency threshold in the plurality of pieces of log data.
- word frequency statistics can be carried out in all log data of the target node, to determine the frequency of occurrence of each word in all log data, when a word appears in all log data of the target node, the frequency is higher than the target times threshold ( Can be artificially set according to needs), the word can be regarded as a high-frequency word; when a word appears in all log data of the target node, the frequency is less than or equal to the target times threshold, the word can be regarded as a non-high-frequency word word.
- Step S604 keeping the high-frequency words in the multiple pieces of log data unchanged and performing placeholder processing on the non-high-frequency words, so as to obtain multiple pieces of log trunks.
- the high-frequency words in each piece of log data can be kept unchanged, and then placeholder processing can be performed on the non-high-frequency words in each piece of log data to obtain the log trunk corresponding to each piece of log data, for example Counters can be used to place place for non-high frequency words. For example, if a non-high-frequency word appears at a certain position in the log data, a counter can be placed at that position, and the counter can display the lowest and highest times of occurrence of the non-high-frequency word at the position .
- the log data of the target node includes the following three log data.
- the second log data log_channel(cluster)log[INF]:mon.03 calling monitor election.
- the third log data log_channel(cluster)log[WRN]: Health check update: 1/5 mons down.
- log trunk of the above three log data can be:
- the first log trunk log_channel(cluster)log(high-frequency word or log key)* ⁇ 1, 6 ⁇ (counter).
- the second log trunk log_channel(cluster)log(high-frequency word or log key)* ⁇ 1, 8 ⁇ (counter).
- log_channel (cluster) log high-frequency word or log key * ⁇ 4, 8 ⁇ (counter).
- the backbone of each piece of log data may also be generated by the following method.
- non-high-frequency words whose probability of appearing simultaneously with the high-frequency words in the multiple pieces of log data is greater than the preset probability threshold as high-frequency associated words; removing the high-frequency associated words from the non-high-frequency words; keeping The high-frequency words and the high-frequency associated words in the multiple pieces of log data remain unchanged, and the non-high-frequency words are subjected to placeholder processing to obtain multiple log trunks.
- [DBG], [INF], and [WRN] appear together with the high-frequency word log_channel(cluster)log many times in the multiple log data of the target log node
- [DBG], [INF] And [WRN] as a high-frequency associated word of the high-frequency word log_channel(cluster)log
- [DBG], [INF], and [WRN] can be removed from the non-high-frequency words, and then keep the high-frequency words when generating the log trunk The high-frequency associated words remain unchanged, and only the non-high-frequency words are occupied.
- the above three log data can generate the following log trunk.
- the first log trunk log_channel(cluster)log(high-frequency word or log key)*[DBG]* ⁇ 1, 6 ⁇ (counter).
- the second log trunk log_channel(cluster)log(high-frequency word or log key)*[INF]* ⁇ 1, 8 ⁇ (counter).
- log_channel (cluster) log high-frequency word or log key * [WRN] * ⁇ 4, 8 ⁇ (counter).
- Step S606 clustering the multiple log data according to the multiple log backbones to determine multiple log clusters.
- the log data with the same log backbone can be clustered, but this disclosure does not make any changes to the log clustering method. limit.
- Step S608 determining the log category of the data in each log cluster.
- the category of each cluster may be used to represent the log category of each piece of log data in the clustering result, and the present disclosure does not limit the manner of determining the log category in each log cluster.
- the technical solutions provided in FIG. 2 , FIG. 4 , FIG. 5 and FIG. 6 can be used in the training process of the anomaly detection model, and can also be used in the process of cluster anomaly detection, which is not limited in the present disclosure.
- the following method can be used to determine the loss function.
- cluster anomalies are relatively rare. If the cluster anomaly detection model is trained using measured data, the training results will be inaccurate due to the small number of negative samples corresponding to cluster anomalies, which in turn will make the determination of the predicted anomaly type low accuracy.
- the present disclosure proposes the following method to determine the loss function of the anomaly detection model, which can be explained in combination with formula (1).
- the present disclosure also provides the following technical solutions to realize cluster anomaly detection.
- CNN convolutional neural network
- Log data is a kind of text data, but unlike natural language text data, the log format is more casual and does not strictly follow the syntax. Log data is always written in a specific format (such as timestamp, event, variable), and its structure is single and recurring. Therefore, it is convenient to use statistical methods to analyze it.
- the log feature extraction algorithm is an unsupervised clustering algorithm for logs. Firstly, the word frequency of each word in the log is counted, and a frequency threshold is artificially set. When the frequency of a word is higher than the threshold, it is considered a high-frequency word; when the word frequency is lower than the threshold, it is considered a low-frequency word. Use high-frequency words as the backbone of the log; and then merge high-frequency words to a certain extent. When some words (such as key n ) appear at the same time as other words in the trunk (such as key n-1 ... key 2 key 1 ), the probability is greater than When a certain threshold is reached, that is
- the algorithm uses a counter to describe the low-frequency word, and the counter records the minimum and maximum occurrence times of the low-frequency word.
- Logs are clustered according to the trunk of each log, and logs with the same trunk are grouped into one category.
- the time can be divided into time segments with a granularity of 5 minutes, and the log sequence is truncated into long and short sequence segments according to the timestamp of the log in each time segment, and at the same time, the Performance indicators (including CPU utilization, memory utilization, swap memory utilization, disk IO read and write, data packet sending and receiving, etc.).
- Performance indicators including CPU utilization, memory utilization, swap memory utilization, disk IO read and write, data packet sending and receiving, etc.
- the overall framework of the deep learning model we designed is shown in Figure 7.
- the log sequence extracted from the OSD is discontinuous one-hot data, and we need to use the word embedding (Embedding) method to convert it into a continuous vector.
- Embedding word embedding
- the vector extracted from the log is concatenated with the normalized indicator vectors on multiple OSDs as the input of the last fully connected layer. This vector covers the information of log data and indicator data.
- Fig. 8 shows a block diagram of a cluster anomaly detection device according to an exemplary embodiment.
- the cluster anomaly detection device 800 provided by the embodiment of the present disclosure may include: a log data acquisition module 801 , a log category determination module 802 , a log category matrix determination module 803 , a log category vector generation module 804 , and a performance index vector acquisition module 805 , a node feature vector determination module 806 and a prediction module 807 .
- the log data obtaining module 801 can be used to obtain multiple pieces of log data and multiple performance indicators from the target nodes in the cluster; the log category determination module 802 can be used to aggregate the multiple pieces of log data class processing to determine the log category of each piece of log data; the log category matrix determination module 803 can be used to generate the log category matrix of the target node according to the log category of each piece of log data; the log category vector generation module 804 It can be used to perform feature extraction on the log category matrix through the abnormal detection model to obtain a log category vector; the performance indicator vector acquisition module 805 can be used to perform feature extraction on the multiple performance indicators through the abnormal detection model Feature extraction to obtain a performance index vector; the node feature vector determination module 806 can be used to perform vector fusion of the log category vector and the performance index vector through the anomaly detection model to obtain the node of the target node feature vector; the prediction module 807 may be configured to classify the node feature vector through the anomaly detection model, so as to determine the predicted anomaly type of the target node in the
- the target node includes a first node and a second node
- the plurality of log data includes a plurality of first log data from the first node and a plurality of second node log data from the second node
- the log category matrix includes a category dimension; wherein, the log category matrix determining module 803 may include: a first log category sequence generating unit, a second log category sequence generating unit, and a first splicing unit.
- the first log category sequence generation unit can be used to determine the log category corresponding to each piece of first log data, and generate the first log category sequence according to the log category corresponding to each piece of first log data;
- the second log The category sequence generation unit can be used to determine the log category corresponding to each piece of second log data, and generate a second log category sequence according to the log category corresponding to each piece of second log data;
- the first splicing unit can be used for according to the category dimension Perform splicing processing on the first log category sequence and the second log category to generate a log category matrix of the target node.
- the log category determination module 802 may include: a high-frequency word determination unit, a log trunk determination unit, a log clustering unit, and a log category determination unit.
- the high-frequency word determination unit can be used to determine the non-high-frequency words whose occurrence times are greater than the target number of times threshold in the plurality of log data; the number of occurrences is less than or equal to the target number of times threshold; To keep the high-frequency words in the multiple pieces of log data unchanged and perform placeholder processing on the non-high-frequency words to obtain multiple log backbones; the log clustering unit can be used to The plurality of pieces of log data are clustered to determine a plurality of log clusters; the log category determining unit can be used to determine the log category of the log data in each log cluster.
- the log trunk determination unit may include: a high-frequency associated word determination subunit, a removal subunit, and a placeholder subunit.
- the high-frequency associated word determination subunit can be used to use the non-high-frequency words whose probability of co-occurrence with the high-frequency word in the plurality of log data is greater than a preset probability threshold as the high-frequency associated word;
- the elimination subunit can be used to remove the high-frequency associated words from the non-high-frequency words; the placeholder subunit can be used to keep the high-frequency words and the high-frequency associated words in the multiple pieces of log data unchanged, And perform placeholder processing on the non-high-frequency words to obtain multiple log trunks.
- the multiple pieces of log data include multiple pieces of third log data collected in the first time period and multiple pieces of fourth log data collected in the second time period, and the log category matrix includes a time dimension; wherein,
- the log category matrix determining module 803 may include: a third log category sequence determining unit, a fourth log category sequence determining unit and.
- the third log category sequence determining unit can be used to determine the log category corresponding to each piece of third log data, and generate a third log category sequence according to the log category corresponding to each piece of third log data;
- the fourth log category sequence determining unit It can be used to determine the log category corresponding to each piece of fourth log data, and generate a fourth log category sequence according to the log category corresponding to each piece of fourth log data;
- the three log category sequences and the fourth log category sequence are concatenated to generate a log category matrix of the target node.
- the log category vector generating module 804 may include: a convolution unit and a pooling unit.
- the convolution unit can be used to perform convolution processing on the log category matrix to obtain a log category convolution feature matrix
- the pooling unit can be used to perform pooling processing on the log category convolution feature matrix to obtain The log category vector.
- the predicted anomaly type includes multiple predicted anomaly types; wherein, the cluster anomaly detection apparatus 800 further includes: a label acquisition module, a loss function value acquisition module, a normalization module and a training module.
- the label obtaining module can be used to obtain multiple abnormal type labels of the target node; the loss function value obtaining module can be used to determine the correspondence of each predicted abnormal type according to the multiple predicted abnormal types and the multiple abnormal type labels The loss function value; the normalization module can be used to normalize the loss function value according to the value of each predicted abnormal type to obtain a normalized loss function value; the training module can be used to pass the normalized The anomaly detection model is trained using the normalization loss function value.
- modules and/or units and/or subunits involved in the embodiments described in the present application may be implemented by software or by hardware.
- the described modules and/or units and/or subunits may also be provided in a processor. Wherein, the names of these modules and/or units and/or subunits do not constitute limitations on the modules and/or units and/or subunits themselves under certain circumstances.
- each block in a flowchart or block diagram may represent a module, program segment, or portion of code that includes one or more logical functions for implementing specified executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block in the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be implemented by a A combination of dedicated hardware and computer instructions.
- Fig. 9 shows a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server according to an embodiment of the present disclosure. It should be noted that the electronic device 900 shown in FIG. 9 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.
- an electronic device 900 includes a central processing unit (CPU) 901, which can operate according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage section 908 into a random access memory (RAM) 903 Instead, various appropriate actions and processes are performed.
- ROM read-only memory
- RAM random access memory
- various programs and data necessary for the operation of the electronic device 900 are also stored.
- the CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904.
- An input/output (I/O) interface 905 is also connected to the bus 904 .
- the following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, etc.; an output section 907 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 908 including a hard disk, etc. and a communication section 909 including a network interface card such as a LAN card, a modem, or the like.
- the communication section 909 performs communication processing via a network such as the Internet.
- a drive 910 is also connected to the I/O interface 905 as needed.
- a removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc. is mounted on the drive 910 as necessary so that a computer program read therefrom is installed into the storage section 908 as necessary.
- the processes described above with reference to the flowcharts can be implemented as computer software programs.
- the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable storage medium, where the computer program includes program codes for executing the methods shown in the flowcharts.
- the computer program may be downloaded and installed from a network via communication portion 909 and/or installed from removable media 911 .
- this computer program is executed by a central processing unit (CPU) 901, the above-mentioned functions defined in the system of the present application are performed.
- CPU central processing unit
- the computer-readable storage medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
- a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program codes are carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a computer-readable signal medium may also be any computer-readable storage medium other than a computer-readable storage medium that can be sent, propagated, or transported for use by or in conjunction with an instruction execution system, apparatus, or device program of.
- Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wires, optical cables, RF, etc., or any suitable combination of the foregoing.
- the present application also provides a computer-readable storage medium, which may be included in the device described in the above-mentioned embodiments; or exist independently without being assembled into the device middle.
- the above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed by the device, the device can implement functions including: obtaining multiple pieces of log data and A plurality of performance indicators; clustering the plurality of log data to determine the log category of each log data; generating a log category matrix of the target node according to the log category of each log data; passing the anomaly detection
- the model performs feature extraction on the log category matrix to obtain a log category vector; performs feature extraction on the multiple performance indicators through the abnormal detection model to obtain a performance indicator vector; Perform vector fusion of the category vector and the performance index vector to obtain the node feature vector of the target node; classify the node feature vector through the abnormal detection model to determine the prediction of the target node in the cluster exception type.
- a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
- the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the methods provided in various optional implementation manners of the foregoing embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Debugging And Monitoring (AREA)
Abstract
一种集群异常检测方法、装置和相关设备。该方法包括:从所述集群中的目标节点获取多条日志数据和多个性能指标(S202);对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别(S204);根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵(S206);通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量(S208);通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量(S210);通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量(S212);通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型(S214)。
Description
本公开要求申请日为2021.06.10、申请号为202110648870.X、发明创造名称为《集群异常检测方法、装置和相关设备》的中国发明专利申请的优先权。
本公开涉及计算机与互联网技术领域,尤其涉及一种集群异常检测方法及装置、电子设备和计算机可读存储介质。
随着互联网高速发展、互联网用户的不断增加,互联网企业对计算和存储能力的要求也越来越高。对于有一定规模的企业来说,一台服务器的运算能力和存储能力是远远不够的,需要企业购建大规模集群。
在集群的日常运维过程中,基于单一指标的检测手段无法对集群进行全面的异常检测。随着集群规模快速的增长,传统的通过手动运维以发现集群异常的运维方法,导致运维人员的工作量也越来越大。
因此,一种简单、有效的集群异常检测方法对于集群运维来说,十分重要。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解。
发明内容
本公开的目的在于提供一种集群异常检测方法、装置、电子设备以及和计算机可读存储介质,能够快速有效地对集群中的节点进行异常检测。
本公开的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本公开的实践而习得。
本公开实施例提供了一种集群异常检测方法,包括:从所述集群中的目标节点获取多条日志数据和多个性能指标;对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所 述集群中的目标节点的预测异常类型。
在一些实施例中,所述目标节点包括第一节点和第二节点,所述多条日志数据包括来自第一节点的多条第一日志数据和来自第二节点的多条第二节点日志数,所述日志类别矩阵包括类别维度;其中,根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵,包括:确定各条第一日志数据对应的日志类别,并根据各条第一日志数据对应的日志类别生成第一日志类别序列;确定各条第二日志数据对应的日志类别,并根据各条第二日志数据对应的日志类别生成第二日志类别序列;按照所述类别维度对所述第一日志类别序列和所述第二日志类别进行拼接处理,以生成所述目标节点的日志类别矩阵。
在一些实施例中,对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别,包括:确定所述多条日志数据中出现次数大于目标次数阈值的高频词和出现次数小于或者等于所述目标次数阈值的非高频词;保持所述多条日志数据中的高频词不变并对所述非高频词进行占位处理,以获得多条日志主干;根据所述多条日志主干对所述多条日志数据进行聚类处理,以确定多个日志聚类;确定各个日志聚类中的日志数据的日志类别。
在一些实施例中,保持所述多条日志数据中的高频词不变并对所述非高频词进行占位处理,以获得多条日志主干,包括:将所述多条日志数据中与所述高频词同时出现的概率大于预设概率阈值的非高频词作为高频关联词;将所述高频关联词从所述非高频词中剔除;保持所述多条日志数据中的高频词和所述高频关联词不变,并对所述非高频词进行占位处理,以获得多条日志主干。
在一些实施例中,所述多条日志数据包括第一时间段采集的多条第三日志数据和第二时间段采集的多条第四日志数据,所述日志类别矩阵包括时间维度;其中,根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵,包括:确定各条第三日志数据对应的日志类别,并根据各条第三日志数据对应的日志类别生成第三日志类别序列;确定各条第四日志数据对应的日志类别,并根据各条第四日志数据对应的日志类别生成第四日志类别序列;按照所述时间维度对所述第三日志类别序列和所述第四日志类别序列进行拼接处理,以生成所述目标节点的日志类别矩阵。
在一些实施例中,通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量,包括:对所述日志类别矩阵进行卷积处理,以获得日志类别卷积特征矩阵;对所述日志类别卷积特征矩阵进行池化处理,以获得所述日志类别向量。
在一些实施例中,所述预测异常类型包括多个预测异常类型;其中,所述方法还包括:获取所述目标节点的多个异常类型标签;根据所述多个预测异常类型和所述多个异常类型标签确定各个预测异常类型对应的损失函数值;根据各个预测异常类型的值对所述损失函数值进行归一化处理,以获得归一化损失函数值;通过所述归一化损失函数值对所述异常检测模型进行训练。
本公开实施例提供了一种集群异常检测装置,包括:日志数据获取模块、日志类别 确定模块、日志类别矩阵确定模块、日志类别向量生成模块、性能指标向量获取模块、节点特征向量确定模块以及预测模块。
其中,所述日志数据获取模块用于从所述集群中的目标节点获取多条日志数据和多个性能指标;所述日志类别确定模块用于对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;所述日志类别矩阵确定模块用于根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;所述日志类别向量生成模块用于通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;所述性能指标向量获取模块用于通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;所述节点特征向量确定模块用于通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;所述预测模块用于通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
本公开实施例提出一种电子设备,该电子设备包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述任一项所述的集群异常检测方法。
本公开实施例提出一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述任一项所述的集群异常检测方法。
本公开实施例提出一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述集群异常检测方法。
本公开实施例提供的异常检测方法、装置及电子设备和计算机可读存储介质,一方面通过目标节点的日志数据和性能数据,简单、便捷地完成了对集群中目标节点的异常检测;另一方面,通过异常检测模型对目标节点的各条日志数据、各条性能指标进行特征提取和分类,高效、准确地确定了集群中目标节点的异常类型;另外,在对目标节点进行特征提取之前,还通过聚类处理对各条日志类数据进行了类别分类,然后对再通过异常检测模型对各条日之类数据的类别进行特征提取,该方法通过减少了特征提取的数据量,便于对大量的日志数据进行处理,提高了数据的处理效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以 根据这些附图获得其他的附图。
图1示出了应用于本公开实施例的集群异常检测方法或集群异常检测装置的示例性系统架构的示意图。
图2是本公开示例性实施例中集群异常检测方法的流程图。
图3是根据一示例性实施例示出的一种数据向量化方法的示意图。
图4是根据一示例性实施例示出的一种日志类别矩阵确定方法的流程图。
图5是根据一示例性实施例示出的一种日志类别确定方法的流程图。
图6是根据一示例性实施例示出的一种日志类别矩阵确定方法的流程图。
图7是根据一示例性实施例示出的一种异常检测模型的网络结构示意图。
图8根据一示例性实施例示出的一种集群异常检测装置的框图。
图9示出了适于用来实现本公开实施例的终端设备或服务器的电子设备的结构示意图。
现在将参考附图更全面地描述示例实施例。然而,示例实施例能够以多种形式实施,且不应被理解为限于在此阐述的实施例;相反,提供这些实施例使得本公开将全面和完整,并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。
本公开所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本公开的各方面。
附图仅为本公开的示意性图解,图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和步骤,也不是必须按所描述的顺序执行。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本说明书中,用语“一个”、“一”、“该”、“所述”和“至少一个”用以表示存在一个或多个要素/组成部分/等;用语“包含”、“包括”和“具有”用以表示开放式的包括在内的意思并且是指除了列出的要素/组成部分/等之外还可存在另外的要素/组成部分/等;用语“第一”、“第二”和“第三”等仅作为标记使用,不是对其对象的数量限制。
下面结合附图对本公开示例实施方式进行详细说明。
图1示出了可以应用于本公开实施例的集群异常检测方法或集群异常检测装置的示例性系统架构的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。其中,终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机、可穿戴设备、虚拟现实设备、智能家居等等。
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所进行操作的装置提供支持的后台管理服务器。后台管理服务器可以对接收到的请求等数据进行分析等处理,并将处理结果反馈给终端设备。
服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器等,本公开对此不做限制。服务器105可例如从所述集群中的目标节点获取多条日志数据和多个性能指标;服务器105可例如对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;服务器105可例如根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;服务器105可例如通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;服务器105可例如通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;服务器105可例如通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;服务器105可例如通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的,服务器105可以是一个实体的服务器,还可以为多个服务器组成,根据实际需要,可以具有任意数目的终端设备、网络和服务器。
图2是根据一示例性实施例示出的一种集群异常检测方法的流程图。本公开实施例所提供的方法可以由任意具备计算处理能力的电子设备来执行,例如该方法可以由上述图1实施例中的服务器或终端设备来执行,也可以由服务器和终端设备共同执行,在下面的实施例中,以服务器为执行主体为例进行举例说明,但本公开并不限定于此。
其中,集群(cluster)就是一组计算机,它们作为一个整体向用户提供一组网络资 源,这些单个的计算机就是集群的节点(node)。
本公开将以Ceph集群(一种统一的分布式存储系统)为例进行解释说明,但本公开对此不做限制。
可以理解的是,集群异常检测包括对集群中节点的异常检测。
参照图2,本公开实施例提供的集群异常检测方法可以包括以下步骤。
步骤S202,从所述集群中的目标节点获取多条日志数据和多个性能指标。
其中,目标节点可以是集群中的任意计算机等物理节点,也可以是OSD(Object Storage Device)节点或者MON(Monitor,监测)节点等功能节点,本公开对此不做限制。
本公开将以目标节点为OSD节点为例进行说明,但本公开对此不做限制。
其中,OSD节点可以是集群的对象存储和搜索进程,它可以负责在本地文件系统上存储对象,并通过网络提供对这些对象的访问。
MON节点可以是集群状态的管理者,维护整个集群的状态。
多个性能指标可以包括但不限于CPU(Central Processing Unit,中央处理器)利用率、内存利用率、交换内存利用率、磁盘IO((Input/Output),输入/输出)读写速度、数据包收发量等与节点性能相关的指标。
在一些实施例中,集群中的目标节点可以是一个,也可以是多个,本公开对此不做限制。
可以理解的是,目标节点可以实时的产出日志数据。本公开获取的多条日志数据可以包括同一时刻从多个目标节点获得的日志数据,也可以包括从同一节点获得的不同时刻的日志数据,本公开对此不做限制。
可以理解的是,本公开获取的多个性能指标可以包括同一时刻从多个目标节点获得的多个性能指标,也可以包括从同一节点获得的不同时刻的多个性能指标,本公开对此不做限制。
步骤S204,对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别。
在一些实施例中,可以按照日志形式、日志内容等对日志进行聚类处理,以将日志分为多个聚类,然后对每个聚类中的日志对应赋予相同的日志类别。
步骤S206,根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵。
在一些实施例中,可以按照某一方向将同一目标节点的日志数据的日志类别序列进行排列,例如按行将该同一节点的日志数据的日志类别序列进行排列。
在一些实施例中,可以按照另一方向将各条日志的日志类别序列进行排列。例如,将不同目标节点的日志类别序列按列排列。
本公开对日志类别矩阵的生成方法不做限制,本领域技术人员可以根据需求自行调整。
步骤S208,通过异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别 向量。
在一些实施例中,从目标节点提取的日志类序列是不连续的one-hot(一位有效编码)数据,我们还需要使用词嵌入(Embedding)的方法将其转变为连续的向量。例如可以使用Item2Vec(一种词袋模型)模型可以有效地提取出每个日志类的向量。如图3所示,可以首先给每种日志类一个随机的N(N为大于或者等于1的整数,例如为50)维向量,在序列上开一个长度为M(M为大于或者等于1的整数,例如为10)的窗口,窗口内的类之间取正例,拉近他们的向量的距离,随机取一些窗口外的类作为负例,让他们的向量的距离变远。从而将每一种日志类转变为向量。这些向量之间的距离反映出各个类之间的时序信息。设最长的序列段为Lmax,那么每个目标OSD上的时序段是一个(50,Lmax的矩阵,将n个OSD的数据沿着第一维(例如行维度)拼接形成(50×n,Lmax)的矩阵,n为大于或者等于1的整数。利用长为3和5的一维卷积核对日志矩阵沿第二维方向卷积,再用max Pooling将日志矩阵转化成两个长为50×n的一维向量,以获得日志类别向量。
在一些实施例中,异常检测模型可以是任意一种可以进行特征提取、分类的网络模型,例如是一种卷积神经网络CNN,还例如是一种循环神经网络RNN等,本公开对此不做限制。
通过异常检测模型对日志类别矩阵进行特征提取可以包括:对所述日志类别矩阵进行卷积处理,以获得日志类别卷积特征矩阵;对所述日志类别卷积特征矩阵进行池化处理,以获得所述日志类别向量。
步骤S210,通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量。
性能指标的特征提取过程与日志类别矩阵的特征提取过程类似,本公开对此不做限制。
步骤S212,通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量。
在一些实施例中,可以按照在日志类别所在维度完成日志类别向量与性能指标向量的融合,例如可以在行维度上完成日志类别向量与性能指标向量的融合,本公开对此上述融合方式不做限制。
可以理解的是,可以在特征提取后进行向量融合,也可以在特征提取前进行信息融合,以进行日志类别信息与性能指标信息的融合。
步骤S214,通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
在一些实施例中,预测异常类别可以包括一个也可以包括多个,本公开对此不做限制。例如,预测异常类别可以包括断网异常、CPU满异常、内存满异常等,本公开对此 不做限制。
在一些实施例中,通过异常检测模型对节点特征向量进行分类处理后可以获得各个预测异常类别出现的概率,例如90%的断网异常、9%的CPU满异常以及1%的内存满异常。
可以理解的是,只有当某个预测异常类别出现的概率大于某个阈值(例如60%),才可以认为该目标节点出现异常,进而判断集群出现异常。
一般来说,当判断目标节点出现异常后,可以根据该异常出现时对应的日志数据对该目标节点的异常位置进行定位,以便进行维修处理等,本公开对此不做限制。
本公开实施例提高的技术方案,一方面通过目标节点的日志数据和性能数据,简单、便捷地完成了对集群中目标节点的异常检测;另一方面,通过异常检测模型对目标节点的各条日志数据、各条性能指标进行特征提取和分类,高效、准确地确定了集群中目标节点的异常类型;另外,在对目标节点进行特征提取之前,还通过聚类处理对各条日志类数据进行了类别分类,然后对再通过异常检测模型对各条日之类数据的类别进行特征提取,该方法通过减少了特征提取的数据量,便于对大量的日志数据进行处理,提高了数据的处理效率。
图4是根据一示例性实施例示出的一种日志类别矩阵确定方法的流程图。
在一些实施例中,目标节点可以包括第一节点和第二节点,多条日志数据包括来自第一节点的多条第一日志数据和来自第二节点的多条第二节点日志数,所述日志类别矩阵包括类别维度。
其中类别维度可以指的是单个节点的各条日志数据的日志类别排列的维度,例如若各个节点的各条日志数据的日志类别按列排列,那么类别为维度就可以是列维度。
参考图4,上述日志类别矩阵确定方法可以包括以下步骤。
步骤S402,确定各条第一日志数据对应的日志类别,并根据各条第一日志数据对应的日志类别生成第一日志类别序列。
在一些实施例中,可以通过对目标节点的所有日志数据的聚类处理,以确定各条第一日志数据对应的第一日志类别,以生成第一日志类别序列。
步骤S404,确定各条第二日志数据对应的日志类别,并根据各条第二日志数据对应的日志类别生成第二日志类别序列。
在一些实施例中,可以通过对目标节点的所有日志数据的聚类处理,以确定各条第二日志数据对应的第二日志类别,以生成第二日志类别序列。
步骤S406,按照所述类别维度对所述第一日志类别序列和所述第二日志类别进行拼接处理,以生成所述目标节点的日志类别矩阵。
在一些实施例中,可以按照类别维度对第一日志类别序列和第二日志类别序列进行拼接处理,以生成该日志类别矩阵。
图5是根据一示例性实施例示出的一种日志类别矩阵确定方法的流程图。
在一些实施例中,多条日志数据可以包括第一时间段采集的多条第三日志数据和第二时间段采集的多条第四日志数据,日志类别矩阵可以包括时间维度。
参考图5,上述日志类别矩阵确定方法可以包括以下步骤。
步骤S502,确定各条第三日志数据对应的日志类别,并根据各条第三日志数据对应的日志类别生成第三日志类别序列。
步骤S504,确定各条第四日志数据对应的日志类别,并根据各条第四日志数据对应的日志类别生成第四日志类别序列。
步骤S506,按照所述时间维度对所述第三日志类别序列和所述第四日志类别序列进行拼接处理,以生成所述目标节点的日志类别矩阵。
其中时间维度可以指的是单个节的多条日志数据的日志类别按照时间排列的维度,例如若各个节点的不同时间的日志数据的日志类别按行排列,那么该时间维度就可以是行维度。
上述实施例提高的日志类别矩阵确定方法可以将不同节点的日志数据的日志类别融合,或者将不同时间的日志数据的日志类别融合。本实施例提供的技术方案,通过对日志数据的日志类别进行特征提取不仅能够准确的预测目标节点的异常类型,相比于对日志数据本身进行特征提取该极大的降低了特征提取数据量,节约了计算资源。
图6是根据一示例性实施例示出的一种日志类别确定方法的流程图。
参考图6,上述日志类别确定方法可以包括以下步骤。
步骤S602,确定所述多条日志数据中出现次数大于目标次数阈值的高频词和出现次数小于或者等于所述目标次数阈值的非高频词。
在一些实施例中,可以在目标节点所有日志数据中进行词频统计,以确定各个词在所有日志数据中的出现频率,当一个词在目标节点所有日志数据中出现的频率高于目标次数阈值(可以根据需要进行人为设定),可以将该词作为高频词;当一个词在该目标节点的所有日志数据中出现的频率小于或者等于该目标次数阈值时,可以将该词作为非高频词。
步骤S604,保持所述多条日志数据中的高频词不变并对所述非高频词进行占位处理,以获得多条日志主干。
在一些实施例中,可以对各条日志数据中的高频词保持不变,然后对各条日志数据中的非高频词进行占位处理,以获得各条日志数据对应的日志主干,例如可以使用计数器对非高频词进行占位处理。例如,若在日志数据中的某一个位置处出现之时一个非高频词,则可以在该位置处放置一个计数器,并通过该计数器展示该为位置处非高频词出现的最低和最高次数。
例如,若目标节点的日志数据包括以下三条日志数据。
第一条日志数据:log_channel(cluster)log[DBG]:osdmap e7729:12 total,12 up,11 in。
第二条日志数据:log_channel(cluster)log[INF]:mon.03 calling monitor election。
第三条日志数据:log_channel(cluster)log[WRN]:Health check update:1/5 mons down。
那么上述三条日志数据的日志主干可以为:
第一条日志主干:log_channel(cluster)log(高频词或者日志键)*{1,6}(计数器)。
第二条日志主干:log_channel(cluster)log(高频词或者日志键)*{1,8}(计数器)。
第三条日志主干:log_channel(cluster)log(高频词或者日志键)*{4,8}(计数器)。
需要注意的是,上述实施例中的计数器的数值为随意设定,与实际可能并不相符。
在另外一些实施例中,还可以通过以下方法生成各条日志数据的主干。
将所述多条日志数据中与所述高频词同时出现的概率大于预设概率阈值的非高频词作为高频关联词;将所述高频关联词从所述非高频词中剔除;保持所述多条日志数据中的高频词和所述高频关联词不变,并对所述非高频词进行占位处理,以获得多条日志主干。
例如,假设在目标日志节点的多条日志数据中,[DBG]、[INF]以及[WRN]均多次与高频词log_channel(cluster)log同时出现,那么可以将[DBG]、[INF]以及[WRN]作为高频词log_channel(cluster)log的高频关联词,那么可以将[DBG]、[INF]以及[WRN]从非高频词中剔除,然后在生成日志主干时保持高频词和高频关联词不变,仅对非高频词进行占位处理。
通过上述方法,上述三条日志数据可以生成以下日志主干。
第一条日志主干:log_channel(cluster)log(高频词或者日志键)*[DBG]*{1,6}(计数器)。
第二条日志主干:log_channel(cluster)log(高频词或者日志键)*[INF]*{1,8}(计数器)。
第三条日志主干:log_channel(cluster)log(高频词或者日志键)*[WRN]*{4,8}(计数器)。
另外,经观察发现,上述三条日志主干的日志形式、高频词大致相同,因此可以对上述三条日志主干进行合并,形成:
log_channel(cluster)log(高频词或者日志键)*[DBG][INF][WRN]*{1,6}(计数器)。
步骤S606,根据所述多条日志主干对所述多条日志数据进行聚类处理,以确定多个 日志聚类。
在一些实施例中,可以将日志主干相同(包括但不限于高频词(和高频关联词)的内容、位置等相同)的日志数据进行聚类,但本公开对此日志聚类方式不做限制。
步骤S608,确定各个日志聚类中的数据的日志类别。
在一些实施例中,可以使用各个聚类的类别表示作为该聚类结果中各条日志数据的日志类别,本公开对各个日志聚类中的日志类别确定方式不做限制。
在一些实施例中,图2、图4、图5以及图6提供的技术方案可以在异常检测模型训练过程中使用,也可以在集群异常检测过程中使用,本公开对此不做限制。
若在集群异常检测模型训练过程中使用本公开提供的技术方案,则可以使用以下方法进行损失函数的确定。
可以理解的是,集群异常出现的情况相对较少,如果使用实测数据对集群异常检测模型进行训练,会由于集群异常对应的负样本数量过少导致训练结果不准确,进而使得预测异常类型的确定的准确率低。
因此,本公开提出了以下方法确定异常检测模型的损失函数,具体可以结合公式(1)进行解释。
获取所述目标节点的多个异常类型标签
根据所述多个预测异常类型y
(ij)和所述多个异常类型标签
确定各个预测异常类型对应的损失函数值
根据公式(1)对所述损失函数值进行归一化处理,以获得归一化损失函数值;通过所述归一化损失函数值对所述异常检测模型进行训练。
综合以上实施例,本公开还提供了以下技术方案,以实现集群异常检测。
1.根据运维人员的经验,人工标注Ceph集群的异常类型。
2.选取Ceph集群中的多个OSD或者MON的日志数据,利用我们提出的日志特征提取算法对非结构化的日志数据进行聚类。再根据聚类结果将日志数据表示为类的序列。
3.利用Item2Vec模型,将日志类的序列从不连续的one-hot数据转化为连续的向量。并按照日志的序列将这些拼接成矩阵,再将从不同OSD或是MON上提取的矩阵沿第一个方向拼接在一起。
4.使用卷积神经网络(CNN)提取日志矩阵中的上下文信息,再用max Pooling(最大值池化)将矩阵转化为一维向量。
5.将日志数据提取的向量与性能指标组成的向量拼接。
6.再通过多层以ReLU(一种激活函数)为激活函数的全连接层和pooling层(池化层),最后通过以Softmax(一种分类器)为激活函数的全连接层。
日志数据是一种文本数据,但是与自然语言文本数据不同,日志的格式较为随意,并不严格的遵循语法。日志数据总是以某种特定的格式书写(如时间戳,事件,变量),其结构单一,且反复出现。因此便于使用统计的方法对其进行分析。
我们提出的日志特征提取算法是一种对日志进行非监督的聚类算法。首先统计日志中每一个词的词频,人为设置一个频率阈值,当一个词出现的频率高于阈值时,认为其是一个高频词;当词频低于阈值时,认为是一个低频词。将高频词作为日志的主干;再对高频词进行一定的合并,当有的词(例如key
n)与主干中其他词(例如key
n-1…key
2key
1)同时出现的概率大于一定的阈值时,即
p(key
n|key
n-1…key
2key
1)>shield (2)
需将其作为高频关联词;算法用计数器描述低频词,计数器记录低频词出现的最低和最高次数。根据每条日志的主干对日志进行聚类,相同主干的日志归为一类。
我们可以利用上述算法对多个OSD的日志进行聚类,将日志用其所属类的id编号表示,从而形成日志类序列。
在一些实施例钟,可以以5分钟为粒度将时间划分成时间段,在每个时间段内按照日志的时间戳将日志类序列截断成长短不一的序列段,同时提取每一个时间段内的性能指标(包括CPU利用率、内存利用率、交换内存利用率、磁盘IO读写、数据包收发等)。我们将每个时间段的日志类序列和性能指标作为输入数据。让有经验的运维人员标注出每个时间段集群是否出现异常,以及出现异常的类型,来作为输入数据的标签。
我们设计的深度学习模型的整体框架如图7所示,从OSD提取的日志类序列是不连续的one-hot数据,我们还需要使用词嵌入(Embedding)的方法将其转变为连续的向量。
此时,将从日志中提取的向量跟多个OSD上的归一化后的指标向量拼接作为最后全连接层的输入,此向量涵盖了日志数据和指标数据的信息。
通过两层以Leaky ReLU(一种激活函数)为激活函数的全连接层,最后通过以Softmax(一种分类器)为激活函数的全连接层,输出结果与人为标记的标签取交叉熵loss。因为Ceph数据中出现异常的概率较低,所以会出现数据倾斜的现象,这里我们使用归一化的交叉熵作为loss(如公式(1)所示),统计每一个时间段内每种标签的个数,并对交叉熵做归一化。
图8根据一示例性实施例示出的一种集群异常检测装置的框图。参照图8,本公开实施例提供的集群异常检测装置800可以包括:日志数据获取模块801、日志类别确定模块802、日志类别矩阵确定模块803、日志类别向量生成模块804、性能指标向量获取 模块805、节点特征向量确定模块806以及预测模块807。
其中,所述日志数据获取模块801可以用于从所述集群中的目标节点获取多条日志数据和多个性能指标;所述日志类别确定模块802可以用于对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;所述日志类别矩阵确定模块803可以用于根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;所述日志类别向量生成模块804可以用于通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;所述性能指标向量获取模块805可以用于通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;所述节点特征向量确定模块806可以用于通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;所述预测模块807可以用于通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
在一些实施例中,所述目标节点包括第一节点和第二节点,所述多条日志数据包括来自第一节点的多条第一日志数据和来自第二节点的多条第二节点日志数,所述日志类别矩阵包括类别维度;其中,所述日志类别矩阵确定模块803可以包括:第一日志类别序列生成单元、第二日志类别序列生成单元以及第一拼接单元。
其中,所述第一日志类别序列生成单元可以用于确定各条第一日志数据对应的日志类别,并根据各条第一日志数据对应的日志类别生成第一日志类别序列;所述第二日志类别序列生成单元可以用于确定各条第二日志数据对应的日志类别,并根据各条第二日志数据对应的日志类别生成第二日志类别序列;第一拼接单元可以用于按照所述类别维度对所述第一日志类别序列和所述第二日志类别进行拼接处理,以生成所述目标节点的日志类别矩阵。
在一些实施例中,所述日志类别确定模块802可以包括:高频词确定单元、日志主干确定单元、日志聚类单元以及日志类别确定单元。
其中,高频词确定单元可以用于确定所述多条日志数据中出现次数大于目标次数阈值的高频词出现次数小于或者等于所述目标次数阈值的非高频词;日志主干确定单元可以用于保持所述多条日志数据中的高频词不变并对所述非高频词进行占位处理,以获得多条日志主干;日志聚类单元可以用于根据所述多条日志主干对所述多条日志数据进行聚类处理,以确定多个日志聚类;日志类别确定单元可以用于确定各个日志聚类中的日志数据的日志类别。
在一些实施例中,日志主干确定单元可以包括:高频关联词确定子单元、剔除子单元以及占位子单元。
其中,高频关联词确定子单元可以用于将所述多条日志数据中与所述高频词同时出现的概率大于预设概率阈值的非高频词作为高频关联词;
剔除子单元可以用于将所述高频关联词从所述非高频词中剔除;占位子单元可以用于保持所述多条日志数据中的高频词和所述高频关联词不变,并对所述非高频词进行占 位处理,以获得多条日志主干。
在一些实施例中,所述多条日志数据包括第一时间段采集的多条第三日志数据和第二时间段采集的多条第四日志数据,所述日志类别矩阵包括时间维度;其中,所述日志类别矩阵确定模块803可以包括:第三日志类别序列确定单元、第四日志类别序列确定单元以及。
其中,第三日志类别序列确定单元可以用于确定各条第三日志数据对应的日志类别,并根据各条第三日志数据对应的日志类别生成第三日志类别序列;第四日志类别序列确定单元可以用于确定各条第四日志数据对应的日志类别,并根据各条第四日志数据对应的日志类别生成第四日志类别序列;第二拼接单元可以用于按照所述时间维度对所述第三日志类别序列和所述第四日志类别序列进行拼接处理,以生成所述目标节点的日志类别矩阵。
在一些实施例中,所述日志类别向量生成模块804可以包括:卷积单元和池化单元。
其中,卷积单元可以用于对所述日志类别矩阵进行卷积处理,以获得日志类别卷积特征矩阵;池化单元可以用于对所述日志类别卷积特征矩阵进行池化处理,以获得所述日志类别向量。
在一些实施例中,所述预测异常类型包括多个预测异常类型;其中,集群异常检测装置800还包括:标签获取模块、损失函数值获取模块、归一化模块和训练模块。
其中,标签获取模块可以用于获取所述目标节点的多个异常类型标签;损失函数值获取模块可以用于根据所述多个预测异常类型和所述多个异常类型标签确定各个预测异常类型对应的损失函数值;归一化模块可以用于根据各个预测异常类型的值对所述损失函数值进行归一化处理,以获得归一化损失函数值;训练模块可以用于通过所述归一化损失函数值对所述异常检测模型进行训练。
由于装置800的各功能已在其对应的方法实施例中予以详细说明,本公开于此不再赘述。
描述于本申请实施例中所涉及到的模块和/或单元和/或子单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的模块和/或单元和/或子单元也可以设置在处理器中。其中,这些模块和/或单元和/或子单元的名称在某种情况下并不构成对该模块和/或单元和/或子单元本身的限定。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方 框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
此外,上述附图仅是根据本公开示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
图9示出了适于用来实现本公开实施例的终端设备或服务器的电子设备的结构示意图。需要说明的是,图9示出的电子设备900仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,电子设备900包括中央处理单元(CPU)901,其可以根据存储在只读存储器(ROM)902中的程序或者从储存部分908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中,还存储有电子设备900操作所需的各种程序和数据。CPU 901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。
以下部件连接至I/O接口905:包括键盘、鼠标等的输入部分906;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分907;包括硬盘等的储存部分908;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器910也根据需要连接至I/O接口905。可拆卸介质911,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器910上,以便于从其上读出的计算机程序根据需要被安装入储存部分908。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分909从网络上被下载和安装,和/或从可拆卸介质911被安装。在该计算机程序被中央处理单元(CPU)901执行时,执行本申请的系统中限定的上述功能。
需要说明的是,本公开所示的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多 种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质,该计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的设备中所包含的;也可以是单独存在,而未装配入该设备中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被一个该设备执行时,使得该设备可实现功能包括:从所述集群中的目标节点获取多条日志数据和多个性能指标;对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例的各种可选实现方式中提供的方法。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,本公开实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者智能设备等)执行根据本公开实施例的方法,例如图2的一个或多个所示的步骤。
本领域技术人员在考虑说明书及实践在这里公开的公开后,将容易想到本公开的其他实施例。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。
应当理解的是,本公开并不限于这里已经示出的详细结构、附图方式或实现方法,相反,本公开意图涵盖包含在所附权利要求的精神和范围内的各种修改和等效设置。
Claims (10)
- 一种集群异常检测方法,其中,其中所述方法包括:从所述集群中的目标节点获取多条日志数据和多个性能指标;对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
- 根据权利要求1所述方法,其中,所述目标节点包括第一节点和第二节点,所述多条日志数据包括来自第一节点的多条第一日志数据和来自第二节点的多条第二节点日志数,所述日志类别矩阵包括类别维度;其中,根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵,包括:确定各条第一日志数据对应的日志类别,并根据各条第一日志数据对应的日志类别生成第一日志类别序列;确定各条第二日志数据对应的日志类别,并根据各条第二日志数据对应的日志类别生成第二日志类别序列;按照所述类别维度对所述第一日志类别序列和所述第二日志类别进行拼接处理,以生成所述目标节点的日志类别矩阵。
- 根据权利要求1或2所述方法,其中,对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别,包括:确定所述多条日志数据中出现次数大于目标次数阈值的高频词和现次数小于或者等于所述目标次数阈值的非高频词;保持所述多条日志数据中的高频词不变并对所述非高频词进行占位处理,以获得多条日志主干;根据所述多条日志主干对所述多条日志数据进行聚类处理,以确定多个日志聚类;确定各个日志聚类中的日志数据的日志类别。
- 根据权利要求3所述方法,其中,保持所述多条日志数据中的高频词不变并对所述非高频词进行占位处理,以获得多条日志主干,包括:将所述多条日志数据中与所述高频词同时出现的概率大于预设概率阈值的非高频词作为高频关联词;将所述高频关联词从所述非高频词中剔除;保持所述多条日志数据中的高频词和所述高频关联词不变,并对所述非高频词进行 占位处理,以获得多条日志主干。
- 根据权利要求1所述方法,其中,所述多条日志数据包括第一时间段采集的多条第三日志数据和第二时间段采集的多条第四日志数据,所述日志类别矩阵包括时间维度;其中,根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵,包括:确定各条第三日志数据对应的日志类别,并根据各条第三日志数据对应的日志类别生成第三日志类别序列;确定各条第四日志数据对应的日志类别,并根据各条第四日志数据对应的日志类别生成第四日志类别序列;按照所述时间维度对所述第三日志类别序列和所述第四日志类别序列进行拼接处理,以生成所述目标节点的日志类别矩阵。
- 根据权利要求5所述方法,其中,通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量,包括:对所述日志类别矩阵进行卷积处理,以获得日志类别卷积特征矩阵;对所述日志类别卷积特征矩阵进行池化处理,以获得所述日志类别向量。
- 根据权利要求1所述方法,其中,所述预测异常类型包括多个预测异常类型;其中,所述方法还包括:获取所述目标节点的多个异常类型标签;根据所述多个预测异常类型和所述多个异常类型标签确定各个预测异常类型对应的损失函数值;根据各个预测异常类型的值对所述损失函数值进行归一化处理,以获得归一化损失函数值;通过所述归一化损失函数值对所述异常检测模型进行训练。
- 一种集群异常检测装置,其中,包括:日志数据获取模块,用于从所述集群中的目标节点获取多条日志数据和多个性能指标;日志类别确定模块,用于对所述多条日志数据进行聚类处理,以确定各条日志数据的日志类别;日志类别矩阵确定模块,用于根据各条日志数据的日志类别生成所述目标节点的日志类别矩阵;日志类别向量生成模块,用于通过所述异常检测模型对所述日志类别矩阵进行特征提取,以获得日志类别向量;性能指标向量获取模块,用于通过所述异常检测模型对所述多个性能指标进行特征提取,以获得性能指标向量;节点特征向量确定模块,用于通过所述异常检测模型将所述日志类别向量和所述性能指标向量进行向量融合,以获得所述目标节点的节点特征向量;预测模块,用于通过所述异常检测模型对所述节点特征向量进行分类处理,以确定所述集群中的目标节点的预测异常类型。
- 一种电子设备,其中,包括:存储器;以及耦合到所述存储器的处理器,所述处理器被用于基于存储在所述存储器中的指令,执行如权利要求1-7任一项所述的集群异常检测方法。
- 一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现如权利要求1-7任一项所述的集群异常检测方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110648870.XA CN113282433B (zh) | 2021-06-10 | 2021-06-10 | 集群异常检测方法、装置和相关设备 |
CN202110648870.X | 2021-06-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022257421A1 true WO2022257421A1 (zh) | 2022-12-15 |
Family
ID=77284110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/140203 WO2022257421A1 (zh) | 2021-06-10 | 2021-12-21 | 集群异常检测方法、装置和相关设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113282433B (zh) |
WO (1) | WO2022257421A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113282433B (zh) * | 2021-06-10 | 2023-04-28 | 天翼云科技有限公司 | 集群异常检测方法、装置和相关设备 |
CN114117418B (zh) * | 2021-11-03 | 2023-03-14 | 中国电信股份有限公司 | 基于社群检测异常账户的方法、系统、设备及存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162445A (zh) * | 2019-05-23 | 2019-08-23 | 中国工商银行股份有限公司 | 基于主机日志及性能指标的主机健康评价方法及装置 |
US20190354457A1 (en) * | 2018-05-21 | 2019-11-21 | Oracle International Corporation | Anomaly detection based on events composed through unsupervised clustering of log messages |
CN111984499A (zh) * | 2020-08-04 | 2020-11-24 | 中国建设银行股份有限公司 | 一种大数据集群的故障检测方法和装置 |
CN111984442A (zh) * | 2019-05-22 | 2020-11-24 | 中兴通讯股份有限公司 | 计算机集群系统的异常检测方法及装置、存储介质 |
CN112306981A (zh) * | 2020-11-03 | 2021-02-02 | 广州科泽云天智能科技有限公司 | 一种面向高性能计算系统故障日志的故障预测方法 |
CN112367222A (zh) * | 2020-10-30 | 2021-02-12 | 中国联合网络通信集团有限公司 | 网络异常检测方法和装置 |
CN113282433A (zh) * | 2021-06-10 | 2021-08-20 | 中国电信股份有限公司 | 集群异常检测方法、装置和相关设备 |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10176435B1 (en) * | 2015-08-01 | 2019-01-08 | Shyam Sundar Sarkar | Method and apparatus for combining techniques of calculus, statistics and data normalization in machine learning for analyzing large volumes of data |
CN106982196B (zh) * | 2016-01-19 | 2020-07-31 | 阿里巴巴集团控股有限公司 | 一种异常访问检测方法及设备 |
US9961496B2 (en) * | 2016-06-17 | 2018-05-01 | Qualcomm Incorporated | Methods and systems for context based anomaly monitoring |
CN108228442B (zh) * | 2016-12-14 | 2020-10-27 | 华为技术有限公司 | 一种异常节点的检测方法及装置 |
FR3061324B1 (fr) * | 2016-12-22 | 2019-05-31 | Electricite De France | Procede de caracterisation d'une ou plusieurs defaillances d'un systeme |
CN106845526B (zh) * | 2016-12-29 | 2019-12-03 | 北京航天测控技术有限公司 | 一种基于大数据融合聚类分析的关联参数故障分类方法 |
CN109397703B (zh) * | 2018-10-29 | 2020-08-07 | 北京航空航天大学 | 一种故障检测方法及装置 |
US10802942B2 (en) * | 2018-12-28 | 2020-10-13 | Intel Corporation | Methods and apparatus to detect anomalies of a monitored system |
CN112084105A (zh) * | 2019-06-13 | 2020-12-15 | 中兴通讯股份有限公司 | 日志文件监测预警方法、装置、设备及存储介质 |
CN112882909A (zh) * | 2019-11-29 | 2021-06-01 | 北京博瑞华通科技有限公司 | 燃料电池系统故障预测方法、装置 |
CN111858242B (zh) * | 2020-07-10 | 2023-05-30 | 苏州浪潮智能科技有限公司 | 一种系统日志异常检测方法、装置及电子设备和存储介质 |
-
2021
- 2021-06-10 CN CN202110648870.XA patent/CN113282433B/zh active Active
- 2021-12-21 WO PCT/CN2021/140203 patent/WO2022257421A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190354457A1 (en) * | 2018-05-21 | 2019-11-21 | Oracle International Corporation | Anomaly detection based on events composed through unsupervised clustering of log messages |
CN111984442A (zh) * | 2019-05-22 | 2020-11-24 | 中兴通讯股份有限公司 | 计算机集群系统的异常检测方法及装置、存储介质 |
CN110162445A (zh) * | 2019-05-23 | 2019-08-23 | 中国工商银行股份有限公司 | 基于主机日志及性能指标的主机健康评价方法及装置 |
CN111984499A (zh) * | 2020-08-04 | 2020-11-24 | 中国建设银行股份有限公司 | 一种大数据集群的故障检测方法和装置 |
CN112367222A (zh) * | 2020-10-30 | 2021-02-12 | 中国联合网络通信集团有限公司 | 网络异常检测方法和装置 |
CN112306981A (zh) * | 2020-11-03 | 2021-02-02 | 广州科泽云天智能科技有限公司 | 一种面向高性能计算系统故障日志的故障预测方法 |
CN113282433A (zh) * | 2021-06-10 | 2021-08-20 | 中国电信股份有限公司 | 集群异常检测方法、装置和相关设备 |
Also Published As
Publication number | Publication date |
---|---|
CN113282433B (zh) | 2023-04-28 |
CN113282433A (zh) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022257421A1 (zh) | 集群异常检测方法、装置和相关设备 | |
CN111431819B (zh) | 一种基于序列化的协议流特征的网络流量分类方法和装置 | |
CN113342564A (zh) | 日志审计方法、装置、电子设备和介质 | |
CN111177319A (zh) | 风险事件的确定方法、装置、电子设备和存储介质 | |
WO2023284132A1 (zh) | 一种云平台日志的分析方法、系统、设备及介质 | |
US8027949B2 (en) | Constructing a comprehensive summary of an event sequence | |
CN112883730B (zh) | 相似文本匹配方法、装置、电子设备及存储介质 | |
CN114398557B (zh) | 基于双画像的信息推荐方法、装置、电子设备及存储介质 | |
US20230038091A1 (en) | Method of extracting table information, electronic device, and storage medium | |
US20200320253A1 (en) | Method and apparatus for generating commentary | |
CN117131281A (zh) | 舆情事件处理方法、装置、电子设备和计算机可读介质 | |
CN114969332A (zh) | 训练文本审核模型的方法和装置 | |
CN114970540A (zh) | 训练文本审核模型的方法和装置 | |
CN113487103A (zh) | 模型更新方法、装置、设备及存储介质 | |
CN115048524B (zh) | 文本分类展示方法、装置、电子设备和计算机可读介质 | |
CN115758211B (zh) | 文本信息分类方法、装置、电子设备和存储介质 | |
CN116127400A (zh) | 基于异构计算的敏感数据识别系统、方法及存储介质 | |
US11636004B1 (en) | Method, electronic device, and computer program product for training failure analysis model | |
CN116155541A (zh) | 面向网络安全应用的自动化机器学习平台以及方法 | |
WO2023070424A1 (zh) | 一种数据库数据的压缩方法及存储设备 | |
CN115329082A (zh) | 基于深度混合神经网络的日志序列异常检测方法 | |
CN113946648A (zh) | 结构化信息生成方法、装置、电子设备和介质 | |
CN114610953A (zh) | 一种数据分类方法、装置、设备及存储介质 | |
Zhang et al. | Tanbih: Get to know what you are reading | |
Meng et al. | Classification of customer service tickets in power system based on character and word level semantic understanding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21944917 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21944917 Country of ref document: EP Kind code of ref document: A1 |