CN111488924A

CN111488924A - Multivariate time sequence data clustering method

Info

Publication number: CN111488924A
Application number: CN202010265442.4A
Authority: CN
Inventors: 王婷; 崔运鹏; 刘娟
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-08-04
Anticipated expiration: 2040-04-07
Also published as: CN111488924B

Abstract

The invention discloses a multivariate time sequence data clustering method, which comprises the steps of carrying out normalization pretreatment on multivariate time sequence data; constructing a sparse self-encoder of a deep learning unsupervised learning model, and performing feature extraction on multivariate time sequence data to construct a new feature sequence; acquiring a clustering K value of a new characteristic sequence of sample data; calculating the distance between new characteristic sequences of different sample data based on the Euclidean distance; clustering the new characteristic sequence set of the sample data; and analyzing potential patterns of the multivariate time sequence data according to the clustering result. According to the invention, through the sparse self-encoder model and the clustering method, the efficiency of processing large-scale data is improved, the sparse self-encoder model is constructed to improve the performance of the model for extracting a new characteristic sequence from the multivariate time sequence data, and meanwhile, the multivariate distance calculation model is constructed according to the Euclidean distance to realize the clustering of the multivariate time sequence data.

Description

Multivariate time sequence data clustering method

Technical Field

The invention relates to the field of data clustering, in particular to a multivariate time series data clustering method.

Background

With the rapid development of the internet of things, research based on time series data is widely applied to multiple fields such as finance and medical treatment. Clustering is an effective time sequence data analysis method, characteristics of time sequence data can be analyzed by mining potential patterns of the time sequence, and application problems of the time sequence data can be further researched on the basis.

At present, a time series clustering method mainly comprises the following steps: (1) time series data clustering method based on division. The number of data categories and an initial clustering center point are first determined, and then the sample points are classified into different categories by calculating the distance between each sample point and the clustering center point until convergence. (2) A time series data clustering method based on density. The category radius and the number of samples within a category are first determined and then clustered until the density of neighboring regions exceeds a set threshold. (3) A time series data clustering method based on hierarchy. The method can be divided into a top-down mode and a bottom-up mode, wherein the top-down mode takes all samples as root nodes, and then the splitting is performed recursively until a single sample class appears; the latter starts from a single sample and merges until a stop condition is met. These methods usually cannot accurately and comprehensively mine the inherent characteristics of time series data, and research on time series data is relatively limited, especially for mining analysis of potential patterns of multivariate time series data, so we develop a multivariate time series data clustering method here.

Disclosure of Invention

The invention aims to provide a multivariate time sequence clustering method, which combines an unsupervised learning model sparse self-encoder and a traditional clustering method Kmeans, constructs a sparse self-encoder model by taking a deep learning L STM model as a basic unit, extracts a new feature sequence set of single variable time sequence data through the sparse self-encoder model, constructs a multivariate distance calculation method according to Euclidean distances to calculate the distances between multiple variable time sequence data of different samples, and then clusters the new feature sequences of all samples by using the Kmeans clustering method, thereby effectively mining the potential pattern of the multivariate time sequence data based on a clustering result.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention comprises the following steps:

s10, preprocessing the multivariate time sequence data, wherein the preprocessing comprises the steps of carrying out validation and normalization operation on the data;

s20, constructing a sparse self-encoder of a deep learning unsupervised learning model by taking a deep learning L STM model as a basic unit, and performing feature extraction on multivariate time series data to construct a new feature sequence;

s30, acquiring a cluster K value of the new characteristic sequence of the sample data;

s40, calculating the distance between new feature sequences of different sample data based on Euclidean distance;

s50, clustering the new characteristic sequence set of the sample data;

s60, analyzing the potential mode of the multivariate time sequence data according to the clustering result, averaging all the sample point data in each category to obtain the average value of each time point in the multivariate time sequence data, and acquiring a new multivariate average time sequence with the category as the unit.

Further, the self-encoder model comprises an encoder and a decoder, new characteristic values of the data are extracted by encoding the input data, and then an output is obtained by further decoding, and the output is equal to the input as a model optimization target. The self-encoder model is a training process of adding a sparse term in an optimization function to limit model parameters so as to optimize the model, and the training process is as follows:

(1) taking data points in the time series data one by one as the input of an L STM unit in an encoder, and taking a sequence obtained after the input of the last data point as a new characteristic sequence of sample data;

(2) and taking the new characteristic sequence of the sample data as the input of a decoder, and continuously training a sparse self-encoder model by taking a mean square error function added with a sparse term as an optimization function. The calculation formula of the optimization function is as follows:

further, obtaining a clustering K value of the multivariate time series data, specifically comprising the following steps:

(1) selecting a specific K value within the value range of (1-100), randomly generating samples with the same number as the initial samples in a specific three-dimensional area where the samples are located according to the uniform distribution principle, and clustering by adopting a Kmeans method to obtain W_kThe calculation formula is as follows:

(2) obtaining s by_kThe value n takes the value of 100, and the calculation formula is as follows:

wherein, W_knDenotes W under the condition of specific n value_k。

(4) And performing a second step and a third step on all K values in the K value range, selecting the K value with the fastest Wk drop as the optimal clustering number, and adopting the following calculation formula:

further, the validation comprises deleting data with missing value proportion larger than 80% in the data, and the normalization comprises mapping the data of different medication types of the case to the intervals (0,1) by adopting a most value normalization method, wherein the specific formula is as follows:

further, the method for clustering the new feature sequences of all sample data specifically comprises the following steps:

(1) firstly, randomly dividing all sample points into K categories according to K values;

(2) calculating a new category center point for each category, and clustering all sample points again according to the distance between each sample point and each category center point;

(3) the second step is repeated until the value of K is satisfied.

And evaluating the effectiveness of the method provided by the invention by taking the contour coefficient SC as an evaluation standard of the clustering performance of the multivariate time series data.

Where a (i) represents the average distance of sample i to other samples in the same cluster, and b (i) represents the average distance of sample i to all samples in other clusters.

Compared with the prior art, the invention has the beneficial effects that:

the multivariate time sequence data clustering method provided by the invention fully utilizes the high-performance feature extraction of the unsupervised learning model sparse self-encoder on large-scale data in deep learning and the excellent sequence memory performance of the L STM model on time sequence data in deep learning by combining a new deep learning method and a traditional clustering method, and effectively solves the problem that the traditional Kmeans clustering method cannot well process the large-scale data, thereby better mining and analyzing the potential pattern of the multivariate time sequence data.

Drawings

FIG. 1 is a flow chart of a multivariate time series data clustering method;

FIG. 2 is a block diagram of a sparse self-encoder model;

FIG. 3 is a result of performance evaluation of a multivariate time series data clustering method;

FIG. 4 is a potential pattern mining result of multivariate time series data

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As shown in figure 1, the method is suitable for a multivariate sequence data clustering method, and is used for preprocessing multivariate sequence data, namely, multi-type patient medication data, acquired by a medical institution, firstly, a new characteristic sequence of the data is constructed through characteristic extraction, then, distance measurement between a clustering K value and the new characteristic sequence is acquired, finally, the new characteristic sequence is clustered based on a Kmeans clustering method, and a potential mode of original data is analyzed according to a clustering result.

Step S10: multivariate time series data preprocessing

The medical institution acquires various types of medication data of patients, and the data are respectively subjected to validation and normalization processing to construct a clustered data set. Taking the Medicare data set as an example, 90-day administration data containing about 32 ten thousand cases for two types of drugs (referred to simply as drug a and drug B), the following pre-processing is performed:

and (4) activating. The case data with the missing value proportion of more than 80 percent in the data is deleted, and the number of cases is reduced from about 32 ten thousands to about 31 ten thousands.

And (6) normalizing. The data of different medication types of the case are mapped between the intervals (0,1) by adopting a most value normalization method, and the specific formula is as follows:

step S20: construction of new feature sequences by feature extraction of multivariate time series data

In the step, a deep learning unsupervised learning model sparse self-encoder is constructed by taking a deep learning long-short term memory (L ong-term memory, L STM) model as a basic unit, and the multivariate time series data is subjected to feature extraction to construct a new feature sequence of the sample data.

L STM is a special deep learning RNN model, can solve the problem of gradient disappearance appearing in the long-time sequence data training process, has better time sequence memory performance than the ordinary RNN model L STM contains 3 gates, which are respectively (1) update gate, used for controlling input gate (2) output gate, used for controlling the past degree of the existing content (3) forget gate, used for controlling output, the update formula of model parameters is as follows:

a^＜t＞＝_o*tanh(c^＜t＞)

_f＝σ(W_f[a^＜t-1＞,x^＜t＞]+b_f)

_u＝σ(W_u[a^＜t-1＞,x^＜t＞]+b_u)

_o＝σ(W_o[a^＜t-1＞,x^＜t-1＞]+b_o)

the sparse self-encoder model respectively comprises an encoder part and a decoder part, and as shown in fig. 2, the part a belongs to the encoder part of the sparse self-encoder model; part B belongs to the decoder part of the sparse self-encoder model.

The method comprises the steps of firstly, encoding input data to extract a new characteristic value of the data, and then, further decoding to obtain an output, wherein the output is equal to the input and is used as a model optimization target. The sparse self-encoder model is a training process for optimizing a model by adding a sparse term to an optimization function of the self-encoder model to limit model parameters, and the training process is as follows:

taking data points in the time series data one by one as the input of an L STM unit in an encoder, and taking a sequence obtained after the input of the last data point as a new characteristic sequence of sample data;

and taking the new characteristic sequence of the time sequence data as the input of a decoder, and continuously training a sparse self-encoder model by taking a mean square error function added with sparse items as an optimization function. The calculation formula of the optimization function is as follows:

in the case, the administration data of 90 days of case-to-case drug A and drug B are respectively converted into 50-dimensional new characteristic sequences by data extraction.

Step S30: and acquiring a clustering K value of the new sample data characteristic sequence set.

In the step, a Gap statistical method is used for acquiring a clustering K value of multivariate time sequence data, and the specific process is as follows:

setting a value range of the K value;

selecting a specific K value in a value range, randomly generating samples with the same number as the initial samples in a specific three-dimensional area where the samples are located according to a uniform distribution principle, and clustering by adopting Kmeans to obtain Wk, wherein the calculation formula is as follows:

the sk value is obtained by repeating the second step 2-5 times, and the calculation formula is as follows:

repeating the second step and the third step for all K values in the K value range, selecting the K value with the fastest Wk drop as the optimal clustering number, and adopting the following calculation formula:

in the case, based on the above calculation of Gap static, 4 is selected as the cluster K value as shown in fig. 3.

Step S40: calculating the distance between new characteristic sequences of different sample data

In the step, a multivariate distance calculation model is constructed according to Euclidean distances, and the distance between new characteristic sequences of different sample data is calculated, wherein the calculation formula is as follows:

step S50: clustering new characteristic sequence set of sample data based on Kmeans clustering method

The new characteristic sequences of all sample data are clustered in the step, and the specific process is as follows:

(3) the second step is repeated until the value of K is satisfied.

As shown in Table 1, the SC value of the method provided by the invention is higher than that of other existing methods, and the SC value is highest and the clustering performance is optimal under the condition that the Euclidean distance is taken as the clustering measurement.

TABLE 1 clustering performance results of different clustering methods under different distance metrics

Distance measurement	Hierarchical clustering method	k-means	bi-kmeans	k-medoids	The invention
						Euclidean	0.65	0.56	0.69	0.63	0.88
Pearson	0.41	0.49	0.65	0.59	0.72
						LCSS	0.55	0.52	0.67	0.53	0.70
DTW	0.63	0.54	0.61	0.47	0.67
						EDR	0.57	0.58	0.59	0.51	0.66

Step S60: analyzing potential patterns of multivariate time sequence data according to clustering results

Averaging all sample point data in each category to obtain an average value of each time point in the multivariate time sequence data, acquiring a new multivariate average time sequence with the category as a unit, and further researching the potential mode of the multivariate time sequence data on the basis of the new multivariate average time sequence.

According to the above analysis method, the potential patterns of the data of two drugs administered to the patient in the case are shown in fig. 4, and can be divided into 4 types: (1) type a, i.e. ultra low dose administration. The dosage of the two medicines is about 0, and the number of cases accounts for 32.3 percent of the total number of cases; (2) type B, i.e. low dose administration. The dosage of OPI is less than 30 percent, the dosage of BZD is less than 2 percent, and the number of cases accounts for 57.5 percent of the total number of cases; (3) type C, i.e. low dose BZD ultra high dose administration of OPI. The dosage interval range of OPI is (30,50), the dosage interval range of BZD is (13, 19), and the number of cases accounts for 5.0 percent of the total number of cases; (4) type D, i.e. high dose administration. The dosage of OPI is more than 220 percent and the dosage of BZD is more than 5 percent, and the number of cases accounts for 5.2 percent of the total number of people.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A multivariate time series data clustering method is characterized by comprising the following steps:

s50, clustering the new characteristic sequence set of the sample data;

2. The method of claim 2, wherein the self-encoder model comprises an encoder and a decoder, and the self-encoder model is used for encoding input data to extract new eigenvalues of the data, and then further decoding to obtain an output, and the output is equal to the input as the model optimization target. The self-encoder model is a training process of adding a sparse term in an optimization function to limit model parameters so as to optimize the model, and the training process is as follows:

3. the method as claimed in claim 1, wherein the clustering K value of the multivariate time series data is obtained by the following steps:

wherein, W_knDenotes W under the condition of specific n value_k。

(3) Repeating the second step and the third step for all K values in the K value range, selecting the K value with the fastest Wk drop as the optimal clustering number, and adopting the following calculation formula:

4. the multivariate time series data clustering method as claimed in claim 1, wherein the validation comprises deleting data with missing value ratio greater than 80%, and the normalization comprises mapping data of different medication types of a case to intervals (0,1) by using a most-valued normalization method, wherein the specific formula is as follows:

5. the method according to claim 1, wherein the new signature sequences of all sample data are clustered by the following steps:

(3) the second step is repeated until the value of K is satisfied.