CN114861753A

CN114861753A - Data classification method and device based on large-scale network

Info

Publication number: CN114861753A
Application number: CN202210306441.9A
Authority: CN
Inventors: 张圣林; 李东闻; 孙永谦; 钟震宇; 张玉志
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-08-05

Abstract

The invention discloses a data classification method and a data classification device based on a large-scale network, wherein the method comprises the following steps: acquiring data to be detected; the data to be detected comprises a system level index and a user level index; carrying out smooth and normalized data preprocessing on a multivariate time sequence of data to be detected to obtain preprocessed data; inputting the preprocessed data into a one-dimensional convolution automatic encoder trained through offline clustering to perform data compression processing, performing feature selection by using a feature index obtained through offline clustering, and performing distance calculation according to a result of the feature selection to perform online data classification; and outputting an online classification result of the data to be detected based on the online data classification. The method and the system can accurately and efficiently cluster the system instances according to the normal modes of the system instances, and remarkably reduce the overhead of abnormal detection training.

Description

Data classification method and device based on large-scale network

Technical Field

The invention relates to the technical field of data detection and classification and the like, in particular to a data classification method and device based on a large-scale network.

Background

Web services are becoming larger and larger, often running thousands or even hundreds of thousands of system instances on different containers, virtual machines, or physical machines. The reliability of these system instances is crucial to the Web service, and abnormal behavior occurring on the system instances may reduce the availability of the Web service, affect the user experience, and even cause huge economic loss. Real-world monitoring index data is typically recorded to form a Multivariate Time Series (MTS). A series of methods based on deep learning can accurately learn complex patterns in massive MTS data for MTS anomaly detection work.

However, there are a large number of system instances in a large-scale Web service (e.g., a large number of millions of system instances in a range of arbiba and byte jumps), training the MTS anomaly detection model for each system instance consumes a large amount of computing resources; on the other hand, complex data patterns in MTS data of different system instances may be greatly different, and training an anomaly detection model for all system instances may reduce the accuracy of anomaly detection work for different system instances. Therefore, deploying these MTS anomaly detection methods in large-scale Web services is a rather challenging problem.

The existing method comprises that copula, Mc2PCA, FCFW and TICC can cluster MTS data; the CTF may cluster data first and then perform anomaly detection. Copolas considers the relationship between two variables in a single MTS, and performs density-based non-parametric estimation by comparing the distance between two MTS; mc2PCA constructs a common projection axis for each cluster, and distributes data to different clusters by calculating reconstruction errors on the corresponding common projection axis; the FCFW generates a clustering result by comparing the distance between two MTSs based on two distance calculation methods, namely DTW and SBD; the TICC focuses on subsequences in MTS, and provides a clustering method based on a model, wherein each cluster in the TICC algorithm is defined by a network describing interdependence correlation among different observed values in typical subsequences in the cluster. CTF is a framework designed for omnianomallly, aiming at improving training efficiency.

Copulas is affected by dimension explosion, and the calculation cost is high; the Mc2PCA only considers the similarity in the clusters and does not consider the similarity among the clusters, which may cause the number of the clusters to be excessive; the time complexity of the DTW algorithm and the SBD algorithm adopted by the FCFW is very high, and the algorithms cannot be applied to large-scale data; the TICC simultaneously segments and clusters the MTS data, so that time and calculation space are consumed, and the TICC cannot be applied to large-scale data. Meanwhile, the four algorithms are designed for ideal smooth data, noise and abnormal data existing in data collected in a real scene are not considered, and the noise and the abnormal data can greatly influence the clustering effect. Overall, the conventional clustering method cannot efficiently and accurately cluster data which is large in scale (number of system instances, number of indexes, and number of time points) and contains noise and abnormality. The CTF can only be used in combination with a specific anomaly detection algorithm, but cannot be used with other anomaly detection algorithms, and has a large limitation.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a data classification method based on a large-scale network, which can effectively reduce clustering overhead and eliminate the influence of noise and abnormity by embedding the main characteristics of MTS (maximum likelihood sequence) into low-dimensional data by embedding the high-dimensional data into the low-dimensional data by using a one-dimensional convolution automatic encoder (1 DCAE). In addition, an efficient and effective strategy is employed to select periodic and representative features to prevent certain features from interfering with the MTS clustering effect. The method is an efficient and robust scheme, can realize accurate and efficient clustering on the normal mode of the MTS of the system instance, and effectively reduces the training overhead of the abnormal detection model.

Another object of the present invention is to provide a data classification device based on a large-scale network.

In order to achieve the above object, in one aspect, the present invention provides a data classification method based on a large-scale network, including:

acquiring data to be detected; the data to be detected comprises a system level index and a user level index; carrying out smooth and normalized data preprocessing on the multivariate time sequence of the data to be detected to obtain preprocessed data; inputting the preprocessed data into a one-dimensional convolution automatic encoder trained through offline clustering for data compression, performing feature selection by using a feature index obtained by offline clustering, and performing distance calculation according to a result of the feature selection to perform online data classification; and outputting an online classification result of the data to be detected based on the online data classification.

In addition, the data classification method based on the large-scale network according to the above embodiment of the present invention may further have the following additional technical features:

further, in one embodiment of the present invention, the system level metrics include: a plurality of CPU utilization, memory utilization, disk I/O and network throughput; the user-level metrics include: average response time, error rate, and page view times.

Further, in one embodiment of the present invention, training the one-dimensional convolution auto-encoder includes: performing the data preprocessing on the multivariable time sequence of the data to be detected off-line to obtain preprocessed data; training a one-dimensional convolution automatic encoder by utilizing the preprocessed data and compressing the time point number of each variable of the preprocessed data to obtain a first hidden representation; and executing the feature selection on the first hidden representation to obtain a feature index, and performing offline clustering in a clustering mode based on the feature index to obtain a cluster center.

Further, in an embodiment of the present invention, the performing feature selection by using the feature index obtained by the offline clustering, and performing distance calculation according to a result of the feature selection to perform online data classification includes: compressing the number of time points on each variable of the preprocessed data by using a one-dimensional convolution automatic encoder trained by offline clustering to obtain a second hidden representation; performing feature selection on the second hidden representation using the feature index to obtain a third hidden representation; and calculating the distance between the third hidden representation and the cluster center, and selecting the cluster corresponding to the cluster center with the shortest distance as the category of the online data classification.

Further, in an embodiment of the present invention, the data preprocessing includes:

filling and deleting or missing values of the multivariate time series MTS by using a linear interpolation mode, extracting a base line of an MTS curve by using a sliding window moving average algorithm to smooth the MTS curve, and adopting normalization in all data to scale each data point to a range of [0,1], wherein the formula of the normalization is as follows:

further, in one embodiment of the present invention, the feature selection includes: deleting aperiodic features, constructing a redundant feature matrix and deleting redundant features.

Further, in an embodiment of the present invention, the deleting aperiodic characteristics includes: extracting periodic information by using YIN, and deleting aperiodic characteristics to obtain reserved characteristics; wherein, YIN (z) _sm )>0 represents a feature z _sm Presence of periodicity, YIN (z) _sm ) 0 denotes the feature z _sm There is no periodic pattern; the constructing of the redundant feature matrix includes: constructing to obtain a redundant feature matrix R E [0,1]] ^M′×M′ And calculating whether redundancy exists between the two characteristics by using a normalized cross-correlation function; where M' represents the number of features remaining after deletion of the aperiodic features, R _ij >0 indicates that there is redundancy between feature i and feature j, R _ij No redundancy exists between feature i and feature j 0; the deleting redundant features comprises: defining a group of unassigned features F, F containing the indices of all M' features, applying a preset feature selection rule from the first rule to the fourth rule in an iterative manner to F until all features are assigned to a set of selection features SF or a set of deletion features DF, concatenating all selected features in SF into z ", and using as the index of said cluster or said classAnd (4) inputting.

Further, in an embodiment of the present invention, a hierarchical clustering manner is adopted to cluster z ″, each piece of data is initialized to be a cluster, the inter-cluster distances are iteratively calculated, and the clusters with the inter-cluster distances lower than a distance threshold are merged until all the inter-cluster distances are greater than the distance threshold.

Further, in one embodiment of the present invention, the inter-cluster distance is a euclidean distance:

where | x | represents the size of the set, and M "is the number of indices in the SF.

The data classification method based on the large-scale network adopts an efficient and effective strategy to select the periodic and representative characteristics, and prevents certain characteristics from interfering the MTS clustering effect. The method is an efficient and robust scheme, can realize accurate and efficient clustering on the normal mode of the MTS of the system instance, and effectively reduces the training overhead of the abnormal detection model.

In order to achieve the above object, another aspect of the present invention provides a data classification apparatus based on a large-scale network, including:

the data acquisition module is used for acquiring data to be detected; the data to be detected comprises a system level index and a user level index;

the data processing module is used for carrying out smooth and normalized data preprocessing on the multivariate time sequence of the data to be detected to obtain preprocessed data;

the data classification module is used for inputting the preprocessed data into a one-dimensional convolution automatic encoder trained through offline clustering to perform data compression processing, performing feature selection by using a feature index obtained by the offline clustering, and performing distance calculation according to a result of the feature selection to perform online data classification;

and the result output module is used for outputting the online classification result of the data to be detected based on the online data classification.

The data classification device based on the large-scale network adopts an efficient and effective strategy to select the periodic and representative characteristics, and prevents certain characteristics from interfering the MTS clustering effect. The method is an efficient and robust scheme, can realize accurate and efficient clustering on the normal mode of the MTS of the system instance, and effectively reduces the training overhead of the abnormal detection model.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a large scale network-based data classification method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a cluster design according to an embodiment of the present invention;

FIG. 3 is an overall view of an abnormality detection section according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a 1D-CAE model according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a feature selection process according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of a large-scale network-based data classification device according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following describes a data classification method and apparatus based on a large-scale network according to an embodiment of the present invention with reference to the accompanying drawings.

FIG. 1 is a flow chart of a large-scale network-based data classification method according to an embodiment of the invention.

As shown in fig. 1, the method includes, but is not limited to, the following steps:

s1, acquiring data to be detected; the data to be detected comprises a system level index and a user level index;

s2, carrying out smooth and normalized data preprocessing on the multivariate time sequence of the data to be detected to obtain preprocessed data;

s3, inputting the preprocessed data into a one-dimensional convolution automatic encoder trained by offline clustering for data compression, performing feature selection by using feature indexes obtained by offline clustering, and performing distance calculation according to the result of the feature selection to perform online data classification;

and S4, outputting an online classification result of the data to be detected based on the online data classification.

It is understood that, in order to actively detect abnormal behavior of a system instance and timely alleviate a system failure, a Web service operator configures different types of system-level indicators (e.g., CPU utilization, memory utilization, disk I/O, network throughput) and user-level indicators (e.g., average response time, error rate, page browsing times) in the system, and continuously collects monitoring data at predetermined time intervals.

Specifically, the obtained monitoring data is used for subsequent data clustering and classification.

As shown in FIG. 2, the design scheme of the clustering part of the invention is composed of two main parts of off-line clustering and on-line classification, and the overall structure is shown in FIG. 2, wherein the solid line represents off-line clustering and the dotted line represents on-line classification.

The offline clustering is divided into four process stages. The first stage is data preprocessing, firstly, the MTS data is smoothed and normalized; training the 1D-CAE and compressing the number of time points on each variable to obtain a hidden representation z; the third stage performs feature selection on z, reducing the number of variables in each piece of data to obtain z'; the last stage uses a hierarchical clustering method to cluster the data.

The online classification is also divided into four stages, the first stage is data preprocessing which is the same as the first stage of offline clustering, and the data is smoothed and normalized; in the second stage, the number of time points on each variable is compressed by using a 1D-CAE encoder obtained by offline clustering and training in the second stage to obtain a hidden representation z; in the third stage, the characteristic (variable) indexes obtained in the third stage of offline clustering are used for performing characteristic selection on z to obtain z'; and the last stage calculates the distance between z' and the cluster center obtained in the fourth stage of off-line clustering, and selects the cluster corresponding to the cluster center with the shortest distance as the data category.

Further, the abnormality detection section is integrally designed as shown in fig. 3:

the anomaly detection part is divided into an offline training anomaly detection model and an online detection data anomaly part.

The invention uses x uniformly _smt Data representing a tth time point of an mth variable of an mth system instance; s th System instance x _s Is an M x T matrix where there are M variables and T time points of monitored data in the system instance.

The embodiments of the invention will be described in detail below with reference to the accompanying drawings:

and (4) preprocessing data. There are typically noise, anomalies, and missing values in the MTS that significantly affect the shape of the data, and their negative impact must be minimized. Since extreme values are generally more likely to be anomalous, the present invention processes the extreme values using a method that deletes the first 5% of the data that deviates from the mean; under the real production condition, errors possibly exist in the data collection process to cause that some missing values exist in the data, and the linear interpolation is used for filling the deleted or missing values; to deal with noise, the present invention smoothes the MTS curve by extracting the baseline of the MTS curve through a sliding window moving average algorithm. Finally, in order to deal with the amplitude difference existing in different data, normalization is adopted in all data, and each data point is scaled to be in a [0,1] range, wherein the specific normalization formula is as follows:

the same data preprocessing steps are adopted for both offline clustering and online classification.

1D-CAE compresses the data. In order to reduce the influence of too high data dimensionality on the clustering efficiency, the invention uses 1D-CAE and a reconstruction loss function to carry out model training to effectively reduce the data dimensionality and capture the nonlinear characteristics of data as shown in FIG. 4.

It is understood that an Auto Encoder (AE) includes two basic units: an encoder and a decoder. The encoder compresses the input into a potential spatial representation, which the decoder uses to reconstruct the input data. The AE can optimize model parameters by minimizing the difference between input and output (reconstruction loss). Convolutional auto-encoders (CAE) use Convolutional Neural Network (CNN) encoders and decoders. In the invention, 1D-CAE is used for feature extraction and dimension reduction. The convolutional encoder can learn the normal pattern of the input data, ignoring noise and anomalies. In the encoder of the invention, each variable in MTS is input into different convolutional neural networks, and M corresponding characteristics can be obtained. The convolutional decoder is composed of M one-dimensional deconvolution neural networks with independent parameters. The output of the encoder is a compressed characteristic z which is an M multiplied by T 'matrix, and T' is the dimension of each variable; the output of the decoder being a reconstruction of the original data

The size is consistent with the input data x. The following figure shows a schematic diagram of the 1D-CAE model,when the input MTS has three variables, the encoder of the 1D-CAE consists of three convolutional neural networks, and the decoder consists of three one-dimensional deconvolution neural networks.

Further, the mean square error is used as a loss function in the offline clustering process by minimizing the input data x and the output data x

And continuously updating the 1D-CAE model. Finally, the encoder structure and parameters of the 1D-CAE are saved and z of the data is obtained as the input of the next stage. The online classification uses the encoder of the 1D-CAE saved by offline clustering to obtain z of data as the next stage input.

And (4) selecting the characteristics. The invention realizes a steady universal feature selection method to reduce the number of features in the measurement dimension and improve the clustering performance. The feature selection process includes three steps: deleting aperiodic features, constructing a redundant feature matrix and deleting redundant features, wherein a schematic diagram is shown in fig. 5:

1) deleting non-periodic characteristics:

first, periodic information, YIN (z), is extracted using YIN _sm )>0 represents a feature z _sm There is a periodicity, and YIN (z) _sm ) 0 denotes the feature z _sm There is no apparent periodic pattern. The index corresponding to features that are aperiodic in most system instances will be removed, as shown in algorithm 1. The feature that is retained after deletion of the aperiodic feature is denoted z'.

Algorithm 1

2) Constructing a redundancy characteristic matrix:

constructing to obtain a redundant feature matrix R E [0,1]] ^M′×M′ (M' represents the number of features retained after deletion of aperiodic features), where R _ij >0 indicates that there is redundancy between feature i and feature j, R _ij There is no redundancy between feature i and feature j 0. Using normalized cross-correlationThe function (normalized cross-correlation, NCC) calculates whether redundancy exists between two features. The specific construction scheme is shown in algorithm 2.

Algorithm 2

3) Deleting redundant features:

the present invention applies a feature selection rule to exploit redundant features in the redundancy matrix R. First a set of unassigned features F is defined, F containing the indices of all M' features, and then the following feature selection rules are applied iteratively in order from rule 1 to rule 4 to F until all features are assigned to the set of selection features SF or the set of deletion features DF. Finally, the invention concatenates all selected features in the SF into z "as input to the clustering or classification step.

Rule 1: if R is _i Completely unrelated to the other rows in R, i.e. R _i Contains only zero:

(a) add i to the selection feature set SF: SF ═ u { i };

(b) deleting i from F and removing the item in R that is related to i;

rule 2: if R is _i Features that are all related to other rows in R and at least one of R is not all related to other features (i.e., there are instances where R has a value of 0 off-diagonal):

(a) add i to the deletion feature set DF: DF ═ DF { u { i };

(b) deleting i from F and removing the item in R related to i;

rule 3: if all features in F are correlated (i.e., R contains only non-zero off-diagonal values):

(a) selecting a feature i with the minimum correlation with the feature contained in the Sf;

(b) add i to the selection feature set SF: SF ═ u { i };

(c) deleting i from F and removing the item in R related to i;

(d) moving the remaining features in F to the deletion feature set DF: DF ═ DF { i } tautome, and terminate.

Rule 4: if neither rule 2 nor rule 3 applies:

(a) selecting a characteristic i with the minimum correlation with the characteristic contained in the F;

(b) definition of

Is the characteristic related to i in F, and then selects the characteristic j epsilon with the maximum characteristic correlation contained in SF (SF);

(c) add i to the set of selection features SF: adding j to the deletion feature set DF: DF ═ DF { j };

(d) deleting i, j from F and removing the item in R that is related to i, j;

in rule 3(a), rule 4(a), and rule 4(b), the features are selected using the following formulas:

where R is the constructed redundant feature moment.

And performing feature selection in the offline clustering process, acquiring and storing a selection feature set SF, and splicing the selected features into z' serving as the input of the next stage. And the online classification uses a selection feature set SF stored in offline clustering to obtain z' of data as input of the next stage.

Clustering and classifying. In the off-line clustering stage, the invention clusters z' by adopting a hierarchical clustering scheme. Firstly, initializing each piece of data into a cluster, and then iteratively calculating Euclidean distance between clusters

| x | represents the size of the set, and M "is the number of indexes in SF and also the number of features retained) and merging clusters whose inter-cluster distance is below a distance threshold until the distance between all clusters is greater than the distance threshold. The offline clustering stage retains all cluster-centric data.

And in the online classification stage, the Euclidean distance between the features obtained by data extraction and all cluster center data reserved in the offline clustering stage is calculated, and then the cluster class corresponding to the cluster center with the closest distance is selected as the class of the data. It is particularly noted that if the distance between the nearest cluster center and the data is also greater than the distance threshold, the data is not classified but is reported to the human task performer as anomalous data, which also enhances the robustness of the present invention.

And (4) detecting the abnormality. In the off-line training part for anomaly detection, the invention respectively trains an anomaly detection model (which can be any existing anomaly detection model) for the cluster center obtained by clustering, and trains by adopting a training scheme consistent with the used anomaly detection model.

The method collects real-time online data, and performs anomaly detection on the online data by using an anomaly detection model corresponding to the data category obtained by clustering classification.

Further, after thousands of real-world system instances are investigated, the present invention can utilize cluster issuance to automatically group system instances into different clusters, with the system instances of each cluster having a similar pattern. Therefore, an MTS anomaly detection model can be trained for each cluster instead of each system instance, and the training overhead can be significantly reduced because the number of clusters is much smaller than that of the system instances.

As an implementation, the existing traditional K-Means algorithms or copula, Mc2PCA, FCFW and TICC may be used as an alternative to the present invention, but neither the effect nor efficiency is as desirable.

As an implementation mode, the clustering method used in the invention is hierarchical clustering, and DBSCAN can be used for replacement; the euclidean distance used in calculating the distance between data may be replaced with a manhattan distance, an SBD, or the like.

Furthermore, as an implementation mode, any computer language can be adopted, and no special requirements are required on software and hardware environments.

Preferably, the present invention is implemented by using a computer language Python3.8, a software environment Tensorflow2.2, and a hardware environment 16C32T Intel (R) Xeon (R) Gold 5218CPU @2.30GHz and 192GB RAM, which can be used as recommended.

According to the data classification method based on the large-scale network, the main characteristics of the MTS are extracted by embedding high-dimensional data into low-dimensional data by using the one-dimensional convolution automatic encoder (1DCAE), so that the clustering overhead can be effectively reduced, and the influence of noise and abnormity is eliminated. In addition, an efficient and effective strategy is employed to select periodic and representative features to prevent certain features from interfering with the MTS clustering effect. The method is an efficient and robust scheme, can realize accurate and efficient clustering on the normal mode of the MTS of the system instance, and effectively reduces the training overhead of the abnormal detection model.

In order to implement the foregoing embodiment, as shown in fig. 6, a large-scale network-based data classification apparatus 10 is further provided in this embodiment, where the apparatus 10 includes: a data acquisition module 100, a data processing module 200, a data classification module 300 and a result output module 400.

A data acquisition module 100, configured to acquire data to be detected; the data to be detected comprises a system level index and a user level index;

the data processing module 200 is configured to perform data preprocessing for smoothing and normalizing the multivariate time sequence of the data to be detected to obtain preprocessed data;

a data classification module 300, configured to input the preprocessed data into a one-dimensional convolution automatic encoder trained by offline clustering to perform data compression processing, perform feature selection using a feature index obtained by offline clustering, and perform distance calculation according to a result of the feature selection to perform online data classification;

and a result output module 400, configured to output an online classification result of the to-be-detected data based on online data classification.

According to the data classification device based on the large-scale network, the main characteristics of the MTS are extracted by embedding high-dimensional data into low-dimensional data by using the one-dimensional convolution automatic encoder (1DCAE), so that the clustering overhead can be effectively reduced, and the influence of noise and abnormity is eliminated. In addition, an efficient and effective strategy is employed to select periodic and representative features to prevent certain features from interfering with the MTS clustering effect. The method is an efficient and robust scheme, can realize accurate and efficient clustering on the normal mode of the MTS of the system instance, and effectively reduces the training overhead of the abnormal detection model.

It should be noted that the foregoing explanation of the embodiment of the data classification method based on the large-scale network is also applicable to the data classification device based on the large-scale network of the embodiment, and details are not repeated here.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A data classification method based on a large-scale network is characterized by comprising the following steps:

acquiring data to be detected; the data to be detected comprises a system level index and a user level index;

carrying out smooth and normalized data preprocessing on the multivariate time sequence of the data to be detected to obtain preprocessed data;

inputting the preprocessed data into a one-dimensional convolution automatic encoder trained through offline clustering for data compression, performing feature selection by using a feature index obtained by offline clustering, and performing distance calculation according to a result of the feature selection to perform online data classification;

and outputting an online classification result of the data to be detected based on the online data classification.

2. The method of claim 1, wherein the system level metrics comprise: a plurality of CPU utilization, memory utilization, disk I/O and network throughput; the user-level metrics include: average response time, error rate, and page view times.

3. The method of claim 1, wherein training the one-dimensional convolutional auto-encoder comprises:

performing the data preprocessing on the multivariable time sequence of the data to be detected off-line to obtain preprocessed data;

training a one-dimensional convolution automatic encoder by utilizing the preprocessed data and compressing the time point quantity of each variable of the preprocessed data to obtain a first hidden representation;

and executing the feature selection on the first hidden representation to obtain a feature index, and performing offline clustering in a clustering mode based on the feature index to obtain a cluster center.

4. The method of claim 3, wherein the performing feature selection using the feature index obtained by the offline clustering, and performing distance calculation according to the result of the feature selection for online data classification comprises:

compressing the number of time points on each variable of the preprocessed data by using a one-dimensional convolution automatic encoder trained by offline clustering to obtain a second hidden representation;

performing feature selection on the second hidden representation using the feature index to obtain a third hidden representation; and calculating the distance between the third hidden representation and the cluster center, and selecting the cluster corresponding to the cluster center with the shortest distance as the category of the online data classification.

5. The method of claim 1, wherein the data preprocessing comprises:

6. the method of claim 1, wherein the feature selection comprises: deleting aperiodic features, constructing a redundant feature matrix and deleting redundant features.

7. The method of claim 6,

the deleting aperiodic characteristics comprising: extracting periodic information by using YIN, and deleting aperiodic characteristics to obtain reserved characteristics; wherein, YIN (z) _sm )>0 represents a feature Z _sm Presence of periodicity, YIN (z) _sm ) 0 represents TeSign z _sm There is no periodic pattern;

the constructing of the redundant feature matrix includes: constructing to obtain a redundant feature matrix R E [0,1]] ^M′×M′ And calculating whether redundancy exists between the two characteristics by using a normalized cross-correlation function; where M' represents the number of features remaining after deletion of the aperiodic features, R _ij >0 indicates that there is redundancy between feature i and feature j, R _ij No redundancy exists between feature i and feature j 0;

the deleting redundant features comprises: defining a group of unassigned features F, wherein F comprises the index of all M 'features, and sequentially and iteratively applying a preset feature selection rule from a first rule to a fourth rule to F until all the features are assigned to a set of selection features SF or a set of deletion features DF, and all the selected features in SF are stitched into z' as input for said clustering or said classifying.

8. The method of claim 7, wherein z "is clustered in a hierarchical clustering manner, each piece of data is initialized to a cluster, inter-cluster distances are iteratively calculated, and clusters with inter-cluster distances below a distance threshold are merged until all inter-cluster distances are greater than the distance threshold.

9. The method of claim 8, wherein the inter-cluster distance is the Euclidean distance:

10. A large-scale network-based data classification device, comprising:

the data classification module is used for inputting the preprocessed data into a one-dimensional convolution automatic encoder trained by offline clustering to perform data compression processing, performing feature selection by using a feature index obtained by the offline clustering, and performing distance calculation according to a result of the feature selection to perform online data classification;