CN114139624A

CN114139624A - Method for mining time series data similarity information based on integrated model

Info

Publication number: CN114139624A
Application number: CN202111438131.4A
Authority: CN
Inventors: 杨旭; 王淼; 雷云霖; 蔡建
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-04

Abstract

A method for mining time series data similarity information based on an integrated model comprises a hidden Markov model and a Wasserstein distance-based conditional variation self-encoder model. The method comprises the steps of establishing an input layer, and carrying out primary processing on an input time sequence; then, a hidden Markov classification layer and a conditional variation encoder layer respectively learn and classify the input data; after learning is finished, through further optimization, two layers of classification models are fused through a Stacking algorithm, and parallel training can be achieved. Meanwhile, the Wasserstein distance is innovatively used for replacing the KL divergence to measure the distance between the two time sequences, so that the classifier has wider application. The method can better perform similar information mining from the time series hidden state and distribution, and can also fuse all the mined information, so that the model learning is more effective, the operation efficiency is higher, and the method has wider applicability.

Description

Method for mining time series data similarity information based on integrated model

Technical Field

The invention belongs to the technical field of data mining and machine learning, and particularly relates to a method for mining time series data similarity information based on an integrated model.

Background

In time series data mining, similarity information is more critical information and is one of the starting points of data mining. However, in the current mining of time series data, many algorithms lose similarity information of data distribution, and only perform similarity calculation from the perspective of data. The similarity mining which only depends on data angles is information loss, and the loss can cause some characteristics which are implicitly contained in time sequence data to be lost, can influence the learning effect, and can cause the difference between the learned distribution and the real distribution to be large. At present, an algorithm utilizing time series distribution information is lacked, distribution similarity is one of the problems of important research in statistics, but mining distribution similarity in mining time series data information has not been widely discussed.

Disclosure of Invention

In order to overcome the disadvantages of the prior art, the present invention provides a method for mining time series data similarity information based on an integrated model, which integrates a hidden markov classifier for mining hidden information of time series data and a conditional variational self-coding classifier based on Wasserstein distance for mining distribution similarity information of time series data based on the integrated model, and classifies the time series data by learning the mined time series data information by using the integrated model. The invention not only can effectively classify the time sequence data, but also can integrate the discrete information and the continuous information of the time sequence data, so that the learning is more effective, the time sequence data can be parallel, and the operation efficiency is higher.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for mining time series distribution similarity information integrated by a hidden markov model and a Wasserstein distance based conditional variational self-encoder, comprising the steps of:

step 1: and processing the original time sequence data to obtain processed time sequence data. The original time sequence data refers to unclassified time sequence data directly acquired, and the original time sequence data can be classified into one or more categories, and the method specifically comprises the following steps:

step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;

step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data obtained in the step 1.1 into a signature vector sig (A) by using a minimum hash function;

step 1.3: after sig (a) of step 1.2 is obtained, sig (a) is divided into different segments, each having a segment bit flag.

Step 1.4: repeating the steps 1.1 to 1.3 on all the classified time sequence data in the step 1.1, thus obtaining segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, completing data preprocessing, and obtaining the processed time sequence data.

Step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; the basic classification layer outputs a new data set with the same size as the input data, and the method specifically comprises the following steps:

step 2.1: inputting the processed time sequence data obtained in the step 1 into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier, wherein the solving parameters specifically comprise the following steps:

step 2.1.1: for a processed time series numberAccording to O ═ O₁，o₂，o₃…，o_TUsing a forward-backward algorithm, calculating the occurrence probability P (O | λ) of the processed time series data when the hidden markov classifier λ is (a, B, pi). Wherein o is₁，o₂，o₃…，o_TRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;

step 2.1.2: for D processed time series data { (O)₁)，(O₂)，...，(O_D) And calculating parameters A, B and pi of the hidden Markov classifier by using a Bohm-Welch algorithm. Wherein (O)_i) I is 1,2, … D represents the i-th processed time-series data;

step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated₁，o₂，o₃…，o_TMost likely hidden state sequence

Wherein

A numerical value O indicating the time series data O after processing at the time i_iIs hidden state.

Step 2.2: inputting the processed time sequence data obtained in the step 1 into a conditional variation self-encoder of a basic classification layer, calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain a weak classifier of the conditional variation self-encoder, and constructing the weak classifier of the conditional variation self-encoder comprises the following steps:

step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder²Where μ denotes normalMean value of distribution, σ²A variance representing a normal distribution;

step 2.2.2: the standard normal distribution N (0,1) is sampled to obtain a sample e. Mu, sigma output to step 2.2.1 neural network encoder²The sum sample belongs to the formula 1 to obtain

Wherein the content of the first and second substances,

obeying a normal distribution N (mu, sigma)²)；

Step 2.2.3:

data with the same dimension as the processed time series data output by the neural network decoder

Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output data

As part of the optimization target error epsilon. And (3) performing multiple iterations to optimize the optimization target to obtain a trained neural network decoder, wherein the step of calculating the Wasserstein distance comprises the following steps:

step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization term

And performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows:

wherein p (x) represents a distribution function of the processed time series data, p (x)_k) X represents the time-series data after processing at time k_kThe probability of (d);

step 2.2.4.2: the Wasserstein distance was calculated using the Sinkhorn approximation algorithm to simplify the amount of computation. The calculation formula of the Wasserstein distance obtained by combining the entropy regular function of the step 2.2.4.1 is formula 2:

wherein the content of the first and second substances,

representing processed time series data O and neural network decoder output data

The distance of the Wasserstein of (1),

indicates at time n, from

Transfer to o_nThe cost function of (2);

step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;

wherein the content of the first and second substances,

is represented by

The reconstruction error of the reconstruction O is calculated in the formula 4.

Wherein the content of the first and second substances,

the result of the representation of the reconstructed O is

The probability of (d);

step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.

And step 3: establishing an integrated fusion layer, using the output of a basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;

step 3.1: and inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data.

Step 3.2: constructing a secondary learner, using the integrated training data set collected in the step 3.1 as training data of the secondary learner, so that the secondary learner learns the output of the basic classification layer, wherein the process of constructing the secondary learner comprises the following steps:

step 3.2.1: and (3) constructing a support vector machine between any two types of samples to classify by using the support vector machine classifier as a secondary learner. Thus, if the number of data classes of the integrated training data set is k, thenNeed to be constructed

A support vector machine classifier;

step 3.2.2: inputting an integrated training data set

Training the support vector machine classifiers, counting the classification classes of each support vector machine classifier for samples of unknown classes after training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class;

step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.

And 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.

Compared with the prior art, the invention has the beneficial effects that:

1) the method can extract the similarity characteristics of the time series data based on the hidden variable, integrate the discrete information and the continuous information of the time series and use the discrete information and the continuous information for data mining, and supplement the blank of the existing time series data mining method.

2) The invention introduces a condition variation self-encoder based on Wasserstein distance, changes KL divergence used for measurement in an original model into the Wasserstein distance, and performs approximate calculation by using a Sinkhorn algorithm, so that an implicit variable fits wider data distribution, and calculation resources are saved.

3) The invention introduces the ensemble learning, integrates and fuses a hidden Markov model for digging time sequence hidden state information and a condition variation self-encoder model for digitizing time sequence similarity information based on Wasserstein distance by utilizing a Stacking fusion model optimization algorithm, reduces the redundancy, enables the base learners to mutually make up for the deficiency, and improves the classification accuracy and the operation efficiency.

4) The method can be used for time series data abnormity detection and traffic flow data prediction, for time series data abnormity detection, the time series of normal conditions can be used as training data to be input into the integrated model for learning, and then the data to be detected is input into the model for classification to obtain whether the data to be detected is abnormal data. For traffic flow data prediction, consistent traffic flow data is marked and input into an integrated model for learning, data to be detected is input after learning is finished, whether the traffic flow is congested or not is judged, the defect that prediction of other methods on emergency is not timely can be overcome, and the robustness of the model is improved.

Drawings

Fig. 1 is an overall structural view of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

As shown in FIG. 1, a method for mining time series data similarity information based on an integration model constructs an input layer, a basic classification layer and an integration fusion layer. And inputting the original data into an input layer, and preprocessing the original data to obtain sample data. The basic classification layer comprises two weak classifiers which are respectively a hidden Markov weak classifier and a conditional variational self-coding weak classifier based on Wasserstein distance. Inputting the processed time sequence data into two weak classifiers in parallel for learning classification, and training and learning n hidden Markov weak classifiers lambda for each class of data₁，λ₂，…，λ_nInputting the processed time series data into all classifiers to obtain n probabilities p₁，p₂，…，p_nAnd taking the class label of the maximum value in all the results as the final classification result of the hidden Markov weak classifier. Inputting the processed time sequence data into a conditional variational self-encoder weak classifier based on Wasserstein distance while using a hidden Markov weak classifier for learning training, sampling in the input to obtain a sample x, and obtaining normal distribution N (mu, sigma) by the x passing through a neural network encoder²) Sufficient statistics of: mean μ and variance σ²(ii) a Followed by N (mu, sigma)²) Sampling to obtain z, z output via neural network decoder

Thus, a new sample under the same distribution is generated by using a neural network decoder which is trained by learning

And finally, taking the outputs of the two weak classifiers as the input of the integrated fusion layer. The integrated algorithm can combine information in two directions during data mining, so that the model mining information is more, and the learning ability is stronger; meanwhile, the base learner can perform parallel training, and the operation efficiency is higher.

Referring to fig. 1, taking mining time series data as an example, the method includes the following steps:

step 1: and preprocessing the original time sequence data to obtain processed time sequence data. The original time sequence data refers to unclassified time sequence data directly acquired, and the original time sequence data can be classified into one or more categories, and the method specifically comprises the following steps:

step 2.1.1: for one processed time series data O ═ O₁，o₂，o₃…，o_TUsing a forward-backward algorithm, calculating the occurrence probability P (O | λ) of the processed time series data when the hidden markov classifier λ is (a, B, pi). Wherein o is₁，o₂，o₃…，o_TRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;

step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated₁，o₂，o₃…，o_TMost likely hidden state sequenceColumn(s) of

Wherein

step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder²Where μ denotes the mean of a normal distribution, σ²A variance representing a normal distribution;

Wherein the content of the first and second substances,

obeying a normal distribution N (mu, sigma)²)；

Step 2.2.3:

wherein the content of the first and second substances,

The distance of the Wasserstein of (1),

indicates at time n, from

Transfer to o_nThe cost function of (2);

wherein the content of the first and second substances,

is represented by

Wherein the content of the first and second substances,

the result of the representation of the reconstructed O is

The probability of (d);

step 3.2.1: and (3) constructing a support vector machine between any two types of samples to classify by using the support vector machine classifier as a secondary learner. Therefore, if the number of data classes of the integrated training data set is k, then it is necessary to construct

A support vector machine classifier;

step 3.2.2: inputting an integrated training data set

The following is a specific application case of the invention in website identification. And taking the time series data of the five website traffics as experimental data, mining the similarity of the time series data through an integrated model to perform network identification, and confirming the category of the website traffic time series data of unknown category.

In this example, the raw time-series data is a time-series data having a length of 2525 and a width of 1024, the length indicating the number of acquisition times included in the time-series data of this example, the width indicating the website traffic volume corresponding to each acquisition time, and the raw time-series data including the traffic volume time-series data of 5 websites. Firstly, data preprocessing is carried out on original time sequence data, and a partial sensitive hashing algorithm is selected to delete partial data in the example, so that each signature vector is divided into six segments on the basis of minimum hashing processing data, data with the highest similarity is reserved, data with the low similarity is deleted, and noise interference is reduced. Then, 200 time series data of each type of website are selected as original time series data to obtain processed time series data. And next, inputting the processed time series data into the step 2 according to categories to construct a basic classification layer. Next, the processed time series data is inputted into the hidden markov weak classifier of step 2.1 of the present invention for hidden information mining, in this example, the inputted size is 1024 × 1. Meanwhile, the processed time series data are input into the Wasserstein distance-based conditional variation self-encoder weak classifier of step 2.2 of the invention in parallel to learn the distribution of the input time series data, and the size of the input data in this example is 1024 × 1. It should be noted that, as described in step 2.2.1, it is necessary to ensure that the sampled samples are uniformly distributed during sampling, so that the number of collected processed time series data for each class is the same. The sampled time series data is then learned according to steps 2.2.2 through 2.2.5 of the present invention. And after learning is finished, taking the output of the basic classification layer as the input of the integrated fusion layer, learning the input according to the step 3 of the invention, and after the learning of the integrated fusion layer is finished, using the integrated fusion layer for website identification. And inputting the flow time sequence data of the unknown website into the integrated fusion layer, wherein the obtained output classification is the category of the unknown website.

The invention may also be used in other industrial scenarios, such as the prediction of traffic flow. It should be noted that traffic flow prediction is a regression problem rather than a classification problem, and therefore, the output of the integrated model needs to be slightly modified, and the distribution of traffic flow data is learned by means of a conditional variation self-encoder based on Wasserstein distance, and the predicted flow for a certain time is output. The method comprises the following specific steps: first, the original time-series data are traffic flow time-series data of a certain place, the length of the original time-series data indicates the number of times included in the time-series, and the width indicates the traffic flow at a certain time. And secondly, preprocessing the original time sequence data to obtain processed time sequence data. And next, inputting the processed time series data into a basic classification layer for carrying out classification information mining to obtain output data. Next, the time when the basic classification layer outputs is taken as the input of a secondary learner in the integrated fusion layer to learn the traffic flow output by the basic classification layer, the output of the secondary learner is the traffic flow, and the output of the secondary learner is taken as the output of the integrated fusion layer. And finally, inputting the time needing to be predicted into the integrated fusion layer, wherein the obtained output is the predicted traffic flow. This makes predictions by mining similarity information for traffic flow time series data.

Meanwhile, the invention can also be used for time sequence data of other dimensions, such as identifying a voice source, wherein the voice time sequence data comprises three dimensions of pronunciation time, pronunciation duration and pronunciation interval, and one dimension is more than the website flow time sequence data and the traffic flow time sequence data in the above case. The voice time sequence data can be directly input into the basic classification layer for classification learning, then the output of the basic classification layer is used as the input of the integrated fusion layer for classification learning, finally the voice time sequence data of unknown classes are used as the input of the integrated fusion layer, and the output classes are the classes of the voice time sequence data of unknown classes. Thus, the similarity of the voice time sequence data of the same source can be used for identifying the voice source.

Claims

1. A method for mining time series data similarity information based on an integrated model is characterized by comprising the following steps:

step 1: processing original time sequence data, and dividing the original time sequence data into one or more categories, wherein the original time sequence data refer to unclassified time sequence data which are directly acquired;

step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; outputting a new data set with the same size as the input data by the basic classification layer;

and step 3: establishing an integrated fusion layer, using the output of the basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through fusing a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;

2. The method for mining similarity information of time-series data based on an integrated model according to claim 1, wherein the step 1 specifically comprises the following steps:

step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data into a signature vector sig (A) by using a minimum hash function;

step 1.3: dividing sig (A) into different segments, each segment having a segment bit flag;

step 1.4: and repeating the steps 1.2 to 1.4 on all the classified time sequence data to obtain segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, and finishing data processing.

3. The method for mining time series data similarity information based on an integrated model according to claim 1, wherein the step 2 of establishing a basic classification layer specifically comprises the following steps:

step 2.1: inputting the processed time sequence data into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, and decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier;

step 2.2: and inputting the processed time sequence data into a conditional variation self-encoder of a basic classification layer, and calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain the weak classifier of the conditional variation self-encoder.

4. The method for mining similarity information of time series data based on integrated model according to claim 3, wherein the step 2.1 of constructing the hidden Markov weak classifier specifically comprises the following steps:

step 2.1.1: for one processed time series data O ═ O₁，o₂，o₃…，o_TCalculating the occurrence probability P (O | lambda) of the processed time sequence data in a hidden Markov classifier lambda (A, B, pi) by using a forward-backward algorithm; wherein o is₁，o₂，o₃…，o_TRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;

step 2.1.2: for D processed time series data { (O)₁)，(O₂)，...，(O_D) Calculating by using a Bohm-Welch algorithm to obtain parameters A, B and pi of the hidden Markov classifier lambda; wherein (O)_i) I is 1,2, … D represents the i-th processed time-series data;

Wherein

5. The method for mining similarity information of time series data based on integrated model according to claim 4, wherein the step 2.2 of constructing the conditional variant self-encoder weak classifier specifically comprises the following steps:

step 2.2.2: sampling the standard normal distribution N (0,1) to obtain a sample E, and determining the output mu, sigma of the neural network encoder in step 2.2.1²The sum sample belongs to the formula 1 to obtain

Wherein the content of the first and second substances,

obeying a normal distribution N (mu, sigma)²)；

Step 2.2.3:

obtaining data with same dimensionality of time sequence data after being output and processed by a neural network decoder

The distance of the target is used as a part of the error epsilon of the optimization target, and the optimization target is optimized through multiple iterations to obtain a trained neural network decoder;

6. The method for mining similarity information of time-series data based on an integrated model according to claim 5, wherein the step 2.2.4 of calculating the Wasserstein distance specifically comprises the following steps:

step 2.2.4.2: the Sinkhorn approximation algorithm is used for calculating the Wasserstein distance to simplify the calculated amount, and the calculation formula of the Wasserstein distance obtained by combining the entropy regulation function of the step 2.2.4.1 is shown as a formula 2:

wherein the content of the first and second substances,

The distance of the Wasserstein of (1),

indicates at time n, from

Transfer to o_nThe cost function of (2);

wherein the content of the first and second substances,

is represented by

Reconstructing the reconstruction error of the O in a calculation mode of a formula 4;

wherein the content of the first and second substances,

the result of the representation of the reconstructed O is

The probability of (c).

7. The method for mining the similarity information of the time-series data based on the integrated model according to claim 1, wherein the step 3 of constructing the integrated fusion layer comprises the following steps:

step 3.1: inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data;

step 3.2: constructing a secondary learner, and using the collected integrated training data set as training data of the secondary learner so that the secondary learner learns the output of the basic classification layer;

8. The method for mining similarity information of time-series data based on integrated model according to claim 7, wherein the step 3.2 of constructing the secondary learner comprises the following steps:

step 3.2.1: a support vector machine classifier is used as a secondary learner, a support vector machine is constructed between any two types of samples for classification, the number of data classes of an integrated training data set is k, and the support vector machine needs to be constructed

A support vector machine classifier;

step 3.2.2: inputting an integrated training data set

And (3) training the support vector machine classifiers, counting the classification classes of each support vector machine classifier for the samples of unknown classes after the training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class.