CN113988156A - Time series clustering method, system, equipment and medium - Google Patents

Time series clustering method, system, equipment and medium Download PDF

Info

Publication number
CN113988156A
CN113988156A CN202111157610.9A CN202111157610A CN113988156A CN 113988156 A CN113988156 A CN 113988156A CN 202111157610 A CN202111157610 A CN 202111157610A CN 113988156 A CN113988156 A CN 113988156A
Authority
CN
China
Prior art keywords
clustering
centroid
function
feature vectors
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111157610.9A
Other languages
Chinese (zh)
Inventor
陈静静
吴睿振
王凛
黄萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Original Assignee
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd filed Critical Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority to CN202111157610.9A priority Critical patent/CN113988156A/en
Publication of CN113988156A publication Critical patent/CN113988156A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a time series clustering method, which comprises the following steps: acquiring a plurality of time sequences and extracting a characteristic vector corresponding to each time sequence; acquiring clustering K values of a plurality of feature vectors; clustering a plurality of feature vectors by using a preset clustering algorithm and the clustering K value; and clustering the plurality of time sequences according to the clustering result. The invention also discloses a system, a computer device and a readable storage medium. According to the scheme provided by the invention, the dimensionality reduction of the time sequence is realized by extracting the characteristic vector of the time sequence, and the data subjected to dimensionality reduction is clustered by using a clustering algorithm, so that the clustering analysis of the time sequence is finally realized. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.

Description

Time series clustering method, system, equipment and medium
Technical Field
The invention relates to the field of clustering, in particular to a time series clustering method, a time series clustering system, time series clustering equipment and a storage medium.
Background
Time series is the most common data, and most of the current time series analysis focuses on the prediction of the time series. However, for some problems, the morphological comparison of time series is also an important problem. For example: the average price per day (or the closing price per day of stocks) of various commodities forms a time series, and how to evaluate the consistency of the price trends of the commodities can be classified as a time series morphological clustering problem.
In order to ensure the reliability and stability of the system and the service, the monitoring system is becoming an indispensable system for each company and enterprise. With the increasing number of services, machines and the like, how to analyze a large number of time series KPIs becomes a problem which needs to be solved in the field of intelligent operation and maintenance. In a plurality of time sequences, some sequences have strong correlation, if the time sequence data can be clustered quickly and accurately, only different types of data are analyzed, and thus the expenses of subsequent data analysis and mining work can be greatly reduced.
Because the dimension (sampling time point) of the time sequence is generally high, some of the time sequence even reaches thousands of dimensions; secondly, the time sequence is changed along with the time, so that the time sequence contains time information, and if the similarity is simply calculated, the time information is ignored; the time steps of different time sequences are not necessarily the same, and some samples are taken every second or every minute, so that the dimensions are inconsistent.
Therefore, the clustering problem of the time series needs to be solved.
Disclosure of Invention
In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a time series clustering method, including the following steps:
acquiring a plurality of time sequences and extracting a characteristic vector corresponding to each time sequence;
acquiring clustering K values of a plurality of feature vectors;
clustering a plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and clustering the plurality of time sequences according to the clustering result.
In some embodiments, extracting the feature vector corresponding to each time series further includes:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
In some embodiments, training the encoding function and the decoding function further comprises:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
In some embodiments, clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value further includes:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a time series clustering system, including:
the extraction module is configured to acquire a plurality of time sequences and extract a feature vector corresponding to each time sequence;
an obtaining module configured to obtain a plurality of clustering K values of the feature vectors;
the first clustering module is configured to cluster the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and the second clustering module is configured to cluster the plurality of time sequences according to the clustering result.
In some embodiments, the extraction module is further configured to:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
In some embodiments, the extraction module is further configured to:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
In some embodiments, the first clustering module is further configured to:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:
at least one processor; and
a memory storing a computer program operable on the processor, wherein the processor executes the program to perform any of the steps of the time series clustering method as described above.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of any one of the time-series clustering methods described above.
The invention has one of the following beneficial technical effects: the scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a time series clustering method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an Autoencoder dimension reduction process provided in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an encoder function and a decoder function provided in an embodiment of the present invention;
fig. 4 is a schematic diagram of 3 randomly selected time sequences after clustering by K ═ 1;
fig. 5 is a schematic diagram of 3 randomly selected time sequences after clustering by K-2;
fig. 6 is a schematic diagram of 3 randomly selected time sequences after clustering by K-3;
fig. 7 is a schematic diagram of 3 randomly selected time sequences after clustering by K-4;
fig. 8 is a schematic structural diagram of a time series clustering system according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
According to an aspect of the present invention, an embodiment of the present invention provides a time series clustering method, as shown in fig. 1, which may include the steps of:
s1, acquiring a plurality of time sequences and extracting a feature vector corresponding to each time sequence;
s2, acquiring clustering K values of a plurality of feature vectors;
s3, clustering the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and S4, clustering the time sequences according to the clustering result.
The scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.
In some embodiments, extracting the feature vector corresponding to each time series further includes:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
In some embodiments, training the encoding function and the decoding function further comprises:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
Specifically, when performing cluster analysis on the time sequence, the Autoencoder may be used to perform dimensionality reduction on the time sequence.
As shown in fig. 2, the Autoencoder includes two processes, an encoder and a decoder, in the encoder process, a hidden layer (or multiple hidden layers) is created, dimension reduction is performed on input x data through an encoder function, and the input data is mapped to a feature space z, that is, an abstract feature. Then in the decoder process, the abstract feature z is mapped back to the original space through the decoder function to obtain a reconstructed sample x'. The parameters of the encoder function and the decoder function are trained by minimizing a defined loss function.
Therefore, building a self-encoder requires several steps:
1. determining the dimension of the abstract feature space, which can be set to an arbitrary value, here assumed to be 3;
2. determining the number of layers of a hidden layer and input and output dimensions in an encoder process, building the encoder, wherein an activation function of the encoder process is generally a sigmoid function, for example, determining the number of layers of the hidden layer and the input and output dimensions in the encoder process, the number of layers and the output dimensions can be set to be any values, an initial value can be given at the initial training time, and dynamic adjustment can be performed according to the size of a loss function in the training process. It is assumed that the number of hidden layers is 2, and the input dimension and the output dimension of the first hidden layer are the dimensions of the time sequence respectivelyAnd 24, the input and output dimensions of the second hidden layer are 24 and 3, respectively. Thus, for a time series x ═ x (x)1,x2,…xm) The built encoder function can be:
Figure BDA0003288858470000071
Figure BDA0003288858470000072
3. determining the number of layers of a hidden layer and input and output dimensions in a decoder process, building the decoder, wherein an activation function of the decoder process is generally a sigmoid function, for example, the number of layers and the dimensions of the hidden layer in the decoder process are determined, the number of layers and the dimensions can be set to be any values, an initial value is given at the initial training, and dynamic adjustment can be performed according to the size of a loss function in the training process. However, the number of hidden layers is generally equal to that of the hidden layers in the encoder process, and the input and output dimensions are opposite to those of the encoder process. This assumes that the number of hidden layers is 2, the input and output dimensions of the first hidden layer are 3 and 24, respectively, and the input and output dimensions of the second hidden layer are 24 and the time series dimensions, respectively. Thus, for a time series x ═ x (x)1,x2,…xm) The decoder function to be built may be:
Figure BDA0003288858470000081
z1*m=σ(z1*24·W24*m+b24*m)
4. setting a loss function, typically a mean square error, for example, the loss function may be set to Mean Squared Error (MSE):
Figure BDA0003288858470000082
5. the parameters of the encoder function and the decoder function are trained by minimizing the loss function, for example:
arg min MSE(z1*m-x1*m)
6. the number of layers of the hidden layers and the input and output dimensions can be adjusted in the training process, the number of layers of the hidden layers and the input and output dimensions are continuously adjusted according to the size of the loss function, and then the process of 3-5 is repeated; until a satisfactory MSE is obtained; in the process, the dimension of an abstract feature space needs to be fixed, namely the output dimension of the last hidden layer in the encoder process, namely the input dimension of the first hidden layer in the decoder process, and finally the sequence x ═ is obtained (x)1,x2,…xm) Abstract features of
Figure BDA0003288858470000083
According to the method, abstract characteristics of all time sequences can be obtained
Figure BDA0003288858470000084
As input to the clustering algorithm.
As shown in fig. 3, the AutoEncoder does not need to use label of the sample in the optimization process, and essentially takes the input of the sample as the input and the output of the neural network at the same time, and the abstract feature representation z of the sample is expected to be learned by minimizing the reconstruction error. The unsupervised optimization mode greatly improves the universality of the model. For the AutoEncoder model based on the neural network, the encoder part compresses data by reducing the number of neurons layer by layer; the decoder part increases the number of neurons layer by layer based on the abstract representation of the data, and finally realizes the reconstruction of the input sample.
In some embodiments, clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value further includes:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
Specifically, when clustering is performed, a K-Means clustering algorithm can be used for clustering.
When clustering is carried out, firstly, a k value is determined, and k is assumed to be 4; from a data set
Figure BDA0003288858470000091
Randomly selecting k data points as a centroid; for data sets
Figure BDA0003288858470000092
Calculating the distance (such as Euclidean distance) between each point and each centroid, and dividing the point to which the centroid belongs when the point is close to which centroid; after all data are grouped together, there are k groups. Then re-computing the centroid of each set; if the distance between the newly calculated centroid and the original centroid is smaller than a certain set threshold (indicating that the position of the recalculated centroid does not change much and tends to be stable or convergent), we can consider that the clustering has reached the expected result and terminate the algorithm; from step "on dataset if new centroid and original centroid distance vary greatly
Figure BDA0003288858470000093
And calculating the distance (such as Euclidean distance) between each point and each centroid, and dividing the point to the set to which the centroid belongs to start iteration when the point is close to which centroid.
In some embodiments, as shown in fig. 4 to fig. 7, when K is 1 to 4, after clustering a plurality of time series by using the scheme proposed by the present invention, graphs of 3 time series are randomly selected from a certain class, so that it can be seen that the similarity of the time series in each class is high, which proves the feasibility of the method of the present invention.
The scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a time series clustering system 400, as shown in fig. 8, including:
an extraction module 401 configured to obtain a plurality of time sequences and extract a feature vector corresponding to each time sequence;
an obtaining module 402 configured to obtain a cluster K value of a plurality of feature vectors;
a first clustering module 403, configured to cluster the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
a second clustering module 404 configured to cluster the plurality of time series according to the clustering result.
In some embodiments, the extraction module 401 is further configured to:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
In some embodiments, the extraction module 401 is further configured to:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
In some embodiments, the first clustering module 403 is further configured to:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
The scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.
Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 9, an embodiment of the present invention further provides a computer apparatus 501, including:
at least one processor 520; and
a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of:
s1, acquiring a plurality of time sequences and extracting a feature vector corresponding to each time sequence;
s2, acquiring clustering K values of a plurality of feature vectors;
s3, clustering the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and S4, clustering the time sequences according to the clustering result.
In some embodiments, extracting the feature vector corresponding to each time series further includes:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
In some embodiments, training the encoding function and the decoding function further comprises:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
In some embodiments, clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value further includes:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
The scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.
Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 10, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the following steps:
s1, acquiring a plurality of time sequences and extracting a feature vector corresponding to each time sequence;
s2, acquiring clustering K values of a plurality of feature vectors;
s3, clustering the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and S4, clustering the time sequences according to the clustering result.
In some embodiments, extracting the feature vector corresponding to each time series further includes:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
In some embodiments, training the encoding function and the decoding function further comprises:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
In some embodiments, clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value further includes:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
The scheme provided by the embodiment of the invention realizes the dimensionality reduction of the time sequence by extracting the characteristic vector of the time sequence, and clusters the dimensionality-reduced data by using a clustering algorithm aiming at the dimensionality-reduced data, thereby finally realizing the clustering analysis of the time sequence. Therefore, the dimension of the feature space of all time sequences is ensured to be consistent, the sampling step length of the time sequences is not required, and time information is considered when the feature vectors are extracted.
Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A time series clustering method is characterized by comprising the following steps:
acquiring a plurality of time sequences and extracting a characteristic vector corresponding to each time sequence;
acquiring clustering K values of a plurality of feature vectors;
clustering a plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and clustering the plurality of time sequences according to the clustering result.
2. The method of claim 1, wherein extracting the feature vector corresponding to each time series further comprises:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
3. The method of claim 2, wherein the encoding function and the decoding function are trained, further comprising:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
4. The method of claim 1, wherein clustering a plurality of the feature vectors using a preset clustering algorithm and the clustering K value, further comprises:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
5. A time series clustering system, comprising:
the extraction module is configured to acquire a plurality of time sequences and extract a feature vector corresponding to each time sequence;
an obtaining module configured to obtain a plurality of clustering K values of the feature vectors;
the first clustering module is configured to cluster the plurality of feature vectors by using a preset clustering algorithm and the clustering K value;
and the second clustering module is configured to cluster the plurality of time sequences according to the clustering result.
6. The system of claim 5, wherein the extraction module is further configured to:
constructing an encoding function and a decoding function;
training the encoding function and the decoding function;
and extracting the characteristic vector of each time sequence by using the trained coding function.
7. The system of claim 6, wherein the extraction module is further configured to:
inputting the time sequence into the coding function to obtain an abstract feature vector;
inputting the abstract feature vector into a decoding function to obtain an output vector;
calculating a loss value using the output vector and the time series input into the coding function;
and adjusting the number of hidden layers in the coding function and the input-output dimension of each hidden layer according to the loss value, and adjusting the number of hidden layers in the decoding function and the input-output dimension of each hidden layer until the loss value meets the preset requirement.
8. The system of claim 5, wherein the first clustering module is further configured to:
randomly selecting K feature vectors from the plurality of feature vectors as an initial centroid;
taking K initial centroids as clustering centroids;
calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into a set of clustering centroids with the minimum distance;
calculating the centroid of each current set and calculating the distance between the centroid of each set and the corresponding clustering centroid;
and in response to the fact that the distance between the centroid of the set and the corresponding clustering centroid is larger than a threshold value, utilizing the set centroid as the clustering centroid, and returning to the step of calculating the distance between each feature vector and each clustering centroid respectively and dividing the feature vectors into the set of clustering centroids with the minimum distance.
9. A computer device, comprising:
at least one processor; and
memory storing a computer program operable on the processor, characterized in that the processor executes the program to perform the steps of the method according to any of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-4.
CN202111157610.9A 2021-09-30 2021-09-30 Time series clustering method, system, equipment and medium Pending CN113988156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111157610.9A CN113988156A (en) 2021-09-30 2021-09-30 Time series clustering method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111157610.9A CN113988156A (en) 2021-09-30 2021-09-30 Time series clustering method, system, equipment and medium

Publications (1)

Publication Number Publication Date
CN113988156A true CN113988156A (en) 2022-01-28

Family

ID=79737356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111157610.9A Pending CN113988156A (en) 2021-09-30 2021-09-30 Time series clustering method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN113988156A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034337A (en) * 2022-08-10 2022-09-09 江西科骏实业有限公司 Method and device for identifying state of traction motor in rail transit vehicle and medium
WO2023169274A1 (en) * 2022-03-08 2023-09-14 阿里巴巴(中国)有限公司 Data processing method and device, and storage medium and processor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023169274A1 (en) * 2022-03-08 2023-09-14 阿里巴巴(中国)有限公司 Data processing method and device, and storage medium and processor
CN115034337A (en) * 2022-08-10 2022-09-09 江西科骏实业有限公司 Method and device for identifying state of traction motor in rail transit vehicle and medium

Similar Documents

Publication Publication Date Title
CN110163261B (en) Unbalanced data classification model training method, device, equipment and storage medium
CN110162669B (en) Video classification processing method and device, computer equipment and storage medium
CN109034861B (en) User loss prediction method and device based on mobile terminal log behavior data
CN113988156A (en) Time series clustering method, system, equipment and medium
CN110188827B (en) Scene recognition method based on convolutional neural network and recursive automatic encoder model
Ma et al. Error correcting input and output hashing
WO2018224165A1 (en) Device and method for clustering a set of test objects
CN111914159A (en) Information recommendation method and terminal
JP2023502863A (en) Image incremental clustering method and apparatus, electronic device, storage medium and program product
CN111582341A (en) User abnormal operation prediction method and device
CN108256463B (en) Mobile robot scene recognition method based on ESN neural network
JP2019086979A (en) Information processing device, information processing method, and program
CN113536020B (en) Method, storage medium and computer program product for data query
Hoya et al. Heuristic pattern correction scheme using adaptively trained generalized regression neural networks
Yang et al. Unsupervised feature selection based on reconstruction error minimization
CN112784008B (en) Case similarity determining method and device, storage medium and terminal
CN113569955A (en) Model training method, user portrait generation method, device and equipment
EP4162403A1 (en) A method for a distributed learning
CN111784402A (en) Multi-channel based order-descending rate prediction method and device and readable storage medium
CN109033413B (en) Neural network-based demand document and service document matching method
CN115358473A (en) Power load prediction method and prediction system based on deep learning
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN113627514A (en) Data processing method and device of knowledge graph, electronic equipment and storage medium
US10861436B1 (en) Audio call classification and survey system
CN115438239A (en) Abnormity detection method and device for automatic abnormal sample screening

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination