CN113988156A

CN113988156A - Time series clustering method, system, equipment and medium

Info

Publication number: CN113988156A
Application number: CN202111157610.9A
Authority: CN
Inventors: 陈静静; 吴睿振; 王凛; 黄萍
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-28

Abstract

The invention discloses a time series clustering method, which comprises the following steps: acquiring a plurality of time sequences and extracting a characteristic vector corresponding to each time sequence; acquiring clustering K values of a plurality of feature vectors; clustering a plurality of feature vectors by using a preset clustering algorithm and the clustering K value; and clustering the plurality of time sequences according to the clustering result. The invention also discloses a system, a computer device and a readable storage medium. According to the scheme provided by the invention, the dimensionality reduction of the time sequence is realized by extracting the characteristic vector of the time sequence, and the data subjected to dimensionality reduction is clustered by using a clustering algorithm, so that the clustering analysis of the time sequence is finally realized. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.

Description

Time series clustering method, system, equipment and medium

Technical Field

The invention relates to the field of clustering, in particular to a time series clustering method, a time series clustering system, time series clustering equipment and a storage medium.

Background

Time series is the most common data, and most of the current time series analysis focuses on the prediction of the time series. However, for some problems, the morphological comparison of time series is also an important problem. For example: the average price per day (or the closing price per day of stocks) of various commodities forms a time series, and how to evaluate the consistency of the price trends of the commodities can be classified as a time series morphological clustering problem.

In order to ensure the reliability and stability of the system and the service, the monitoring system is becoming an indispensable system for each company and enterprise. With the increasing number of services, machines and the like, how to analyze a large number of time series KPIs becomes a problem which needs to be solved in the field of intelligent operation and maintenance. In a plurality of time sequences, some sequences have strong correlation, if the time sequence data can be clustered quickly and accurately, only different types of data are analyzed, and thus the expenses of subsequent data analysis and mining work can be greatly reduced.

Because the dimension (sampling time point) of the time sequence is generally high, some of the time sequence even reaches thousands of dimensions; secondly, the time sequence is changed along with the time, so that the time sequence contains time information, and if the similarity is simply calculated, the time information is ignored; the time steps of different time sequences are not necessarily the same, and some samples are taken every second or every minute, so that the dimensions are inconsistent.

Therefore, the clustering problem of the time series needs to be solved.

Disclosure of Invention

In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a time series clustering method, including the following steps:

acquiring a plurality of time sequences and extracting a characteristic vector corresponding to each time sequence;

acquiring clustering K values of a plurality of feature vectors;

clustering a plurality of feature vectors by using a preset clustering algorithm and the clustering K value;

and clustering the plurality of time sequences according to the clustering result.

In some embodiments, extracting the feature vector corresponding to each time series further includes:

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

and extracting the characteristic vector of each time sequence by using the trained coding function.

In some embodiments, training the encoding function and the decoding function further comprises:

inputting the time sequence into the coding function to obtain an abstract feature vector;

inputting the abstract feature vector into a decoding function to obtain an output vector;

calculating a loss value using the output vector and the time series input into the coding function;

and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.

In some embodiments, clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value further includes:

randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;

taking K initial centroids as clustering centroids;

calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;

calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;

and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a time series clustering system, including:

the extraction module is configured to acquire a plurality of time sequences and extract a feature vector corresponding to each time sequence;

an obtaining module configured to obtain a plurality of clustering K values of the feature vectors;

the first clustering module is configured to cluster the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;

and the second clustering module is configured to cluster the plurality of time sequences according to the clustering result.

In some embodiments, the extraction module is further configured to:

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

In some embodiments, the extraction module is further configured to:

In some embodiments, the first clustering module is further configured to:

taking K initial centroids as clustering centroids;

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:

at least one processor; and

a memory storing a computer program operable on the processor, wherein the processor executes the program to perform any of the steps of the time series clustering method as described above.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of any one of the time-series clustering methods described above.

The invention has one of the following beneficial technical effects: the scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a time series clustering method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an Autoencoder dimension reduction process provided in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an encoder function and a decoder function provided in an embodiment of the present invention;

fig. 4 is a schematic diagram of 3 randomly selected time sequences after clustering by K ═ 1;

fig. 5 is a schematic diagram of 3 randomly selected time sequences after clustering by K-2;

fig. 6 is a schematic diagram of 3 randomly selected time sequences after clustering by K-3;

fig. 7 is a schematic diagram of 3 randomly selected time sequences after clustering by K-4;

fig. 8 is a schematic structural diagram of a time series clustering system according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

According to an aspect of the present invention, an embodiment of the present invention provides a time series clustering method, as shown in fig. 1, which may include the steps of:

s1, acquiring a plurality of time sequences and extracting a feature vector corresponding to each time sequence;

s2, acquiring clustering K values of a plurality of feature vectors;

s3, clustering the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;

and S4, clustering the time sequences according to the clustering result.

The scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

Specifically, when performing cluster analysis on the time sequence, the Autoencoder may be used to perform dimensionality reduction on the time sequence.

As shown in fig. 2, the Autoencoder includes two processes, an encoder and a decoder, in the encoder process, a hidden layer (or multiple hidden layers) is created, dimension reduction is performed on input x data through an encoder function, and the input data is mapped to a feature space z, that is, an abstract feature. Then in the decoder process, the abstract feature z is mapped back to the original space through the decoder function to obtain a reconstructed sample x'. The parameters of the encoder function and the decoder function are trained by minimizing a defined loss function.

Therefore, building a self-encoder requires several steps:

1. determining the dimension of the abstract feature space, which can be set to an arbitrary value, here assumed to be 3;

2. determining the number of layers of a hidden layer and input and output dimensions in an encoder process, building the encoder, wherein an activation function of the encoder process is generally a sigmoid function, for example, determining the number of layers of the hidden layer and the input and output dimensions in the encoder process, the number of layers and the output dimensions can be set to be any values, an initial value can be given at the initial training time, and dynamic adjustment can be performed according to the size of a loss function in the training process. It is assumed that the number of hidden layers is 2, and the input dimension and the output dimension of the first hidden layer are the dimensions of the time sequence respectivelyAnd 24, the input and output dimensions of the second hidden layer are 24 and 3, respectively. Thus, for a time series x ═ x (x)₁,x₂,…x_m) The built encoder function can be:

3. determining the number of layers of a hidden layer and input and output dimensions in a decoder process, building the decoder, wherein an activation function of the decoder process is generally a sigmoid function, for example, the number of layers and the dimensions of the hidden layer in the decoder process are determined, the number of layers and the dimensions can be set to be any values, an initial value is given at the initial training, and dynamic adjustment can be performed according to the size of a loss function in the training process. However, the number of hidden layers is generally equal to that of the hidden layers in the encoder process, and the input and output dimensions are opposite to those of the encoder process. This assumes that the number of hidden layers is 2, the input and output dimensions of the first hidden layer are 3 and 24, respectively, and the input and output dimensions of the second hidden layer are 24 and the time series dimensions, respectively. Thus, for a time series x ═ x (x)₁,x₂,…x_m) The decoder function to be built may be:

z_1*m＝σ(z_1*24·W_24*m+b_24*m)

4. setting a loss function, typically a mean square error, for example, the loss function may be set to Mean Squared Error (MSE):

5. the parameters of the encoder function and the decoder function are trained by minimizing the loss function, for example:

arg min MSE(z_1*m-x_1*m)

6. the number of layers of the hidden layers and the input and output dimensions can be adjusted in the training process, the number of layers of the hidden layers and the input and output dimensions are continuously adjusted according to the size of the loss function, and then the process of 3-5 is repeated; until a satisfactory MSE is obtained; in the process, the dimension of an abstract feature space needs to be fixed, namely the output dimension of the last hidden layer in the encoder process, namely the input dimension of the first hidden layer in the decoder process, and finally the sequence x ═ is obtained (x)₁,x₂,…x_m) Abstract features of

According to the method, abstract characteristics of all time sequences can be obtained

As input to the clustering algorithm.

As shown in fig. 3, the AutoEncoder does not need to use label of the sample in the optimization process, and essentially takes the input of the sample as the input and the output of the neural network at the same time, and the abstract feature representation z of the sample is expected to be learned by minimizing the reconstruction error. The unsupervised optimization mode greatly improves the universality of the model. For the AutoEncoder model based on the neural network, the encoder part compresses data by reducing the number of neurons layer by layer; the decoder part increases the number of neurons layer by layer based on the abstract representation of the data, and finally realizes the reconstruction of the input sample.

taking K initial centroids as clustering centroids;

Specifically, when clustering is performed, a K-Means clustering algorithm can be used for clustering.

When clustering is carried out, firstly, a k value is determined, and k is assumed to be 4; from a data set

Randomly selecting k data points as a centroid; for data sets

Calculating the distance (such as Euclidean distance) between each point and each centroid, and dividing the point to which the centroid belongs when the point is close to which centroid; after all data are grouped together, there are k groups. Then re-computing the centroid of each set; if the distance between the newly calculated centroid and the original centroid is smaller than a certain set threshold (indicating that the position of the recalculated centroid does not change much and tends to be stable or convergent), we can consider that the clustering has reached the expected result and terminate the algorithm; from step "on dataset if new centroid and original centroid distance vary greatly

And calculating the distance (such as Euclidean distance) between each point and each centroid, and dividing the point to the set to which the centroid belongs to start iteration when the point is close to which centroid.

In some embodiments, as shown in fig. 4 to fig. 7, when K is 1 to 4, after clustering a plurality of time series by using the scheme proposed by the present invention, graphs of 3 time series are randomly selected from a certain class, so that it can be seen that the similarity of the time series in each class is high, which proves the feasibility of the method of the present invention.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a time series clustering system 400, as shown in fig. 8, including:

an extraction module 401 configured to obtain a plurality of time sequences and extract a feature vector corresponding to each time sequence;

an obtaining module 402 configured to obtain a cluster K value of a plurality of feature vectors;

a first clustering module 403, configured to cluster the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;

a second clustering module 404 configured to cluster the plurality of time series according to the clustering result.

In some embodiments, the extraction module 401 is further configured to:

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

In some embodiments, the extraction module 401 is further configured to:

In some embodiments, the first clustering module 403 is further configured to:

taking K initial centroids as clustering centroids;

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 9, an embodiment of the present invention further provides a computer apparatus 501, including:

at least one processor 520; and

a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of:

s2, acquiring clustering K values of a plurality of feature vectors;

and S4, clustering the time sequences according to the clustering result.

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

taking K initial centroids as clustering centroids;

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 10, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the following steps:

s2, acquiring clustering K values of a plurality of feature vectors;

and S4, clustering the time sequences according to the clustering result.

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

taking K initial centroids as clustering centroids;

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A time series clustering method is characterized by comprising the following steps:

acquiring clustering K values of a plurality of feature vectors;

2. The method of claim 1, wherein extracting the feature vector corresponding to each time series further comprises:

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

3. The method of claim 2, wherein the encoding function and the decoding function are trained, further comprising:

4. The method of claim 1, wherein clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value, further comprises:

taking K initial centroids as clustering centroids;

5. A time series clustering system, comprising:

6. The system of claim 5, wherein the extraction module is further configured to:

constructing an encoding function and a decoding function;

training the encoding function and the decoding function;

7. The system of claim 6, wherein the extraction module is further configured to:

8. The system of claim 5, wherein the first clustering module is further configured to:

taking K initial centroids as clustering centroids;

9. A computer device, comprising:

at least one processor; and

memory storing a computer program operable on the processor, characterized in that the processor executes the program to perform the steps of the method according to any of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-4.