CN114139624A - Method for mining time series data similarity information based on integrated model - Google Patents

Method for mining time series data similarity information based on integrated model Download PDF

Info

Publication number
CN114139624A
CN114139624A CN202111438131.4A CN202111438131A CN114139624A CN 114139624 A CN114139624 A CN 114139624A CN 202111438131 A CN202111438131 A CN 202111438131A CN 114139624 A CN114139624 A CN 114139624A
Authority
CN
China
Prior art keywords
data
time sequence
series data
time
time series
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111438131.4A
Other languages
Chinese (zh)
Inventor
杨旭
王淼
雷云霖
蔡建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202111438131.4A priority Critical patent/CN114139624A/en
Publication of CN114139624A publication Critical patent/CN114139624A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method for mining time series data similarity information based on an integrated model comprises a hidden Markov model and a Wasserstein distance-based conditional variation self-encoder model. The method comprises the steps of establishing an input layer, and carrying out primary processing on an input time sequence; then, a hidden Markov classification layer and a conditional variation encoder layer respectively learn and classify the input data; after learning is finished, through further optimization, two layers of classification models are fused through a Stacking algorithm, and parallel training can be achieved. Meanwhile, the Wasserstein distance is innovatively used for replacing the KL divergence to measure the distance between the two time sequences, so that the classifier has wider application. The method can better perform similar information mining from the time series hidden state and distribution, and can also fuse all the mined information, so that the model learning is more effective, the operation efficiency is higher, and the method has wider applicability.

Description

Method for mining time series data similarity information based on integrated model
Technical Field
The invention belongs to the technical field of data mining and machine learning, and particularly relates to a method for mining time series data similarity information based on an integrated model.
Background
In time series data mining, similarity information is more critical information and is one of the starting points of data mining. However, in the current mining of time series data, many algorithms lose similarity information of data distribution, and only perform similarity calculation from the perspective of data. The similarity mining which only depends on data angles is information loss, and the loss can cause some characteristics which are implicitly contained in time sequence data to be lost, can influence the learning effect, and can cause the difference between the learned distribution and the real distribution to be large. At present, an algorithm utilizing time series distribution information is lacked, distribution similarity is one of the problems of important research in statistics, but mining distribution similarity in mining time series data information has not been widely discussed.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention provides a method for mining time series data similarity information based on an integrated model, which integrates a hidden markov classifier for mining hidden information of time series data and a conditional variational self-coding classifier based on Wasserstein distance for mining distribution similarity information of time series data based on the integrated model, and classifies the time series data by learning the mined time series data information by using the integrated model. The invention not only can effectively classify the time sequence data, but also can integrate the discrete information and the continuous information of the time sequence data, so that the learning is more effective, the time sequence data can be parallel, and the operation efficiency is higher.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for mining time series distribution similarity information integrated by a hidden markov model and a Wasserstein distance based conditional variational self-encoder, comprising the steps of:
step 1: and processing the original time sequence data to obtain processed time sequence data. The original time sequence data refers to unclassified time sequence data directly acquired, and the original time sequence data can be classified into one or more categories, and the method specifically comprises the following steps:
step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;
step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data obtained in the step 1.1 into a signature vector sig (A) by using a minimum hash function;
step 1.3: after sig (a) of step 1.2 is obtained, sig (a) is divided into different segments, each having a segment bit flag.
Step 1.4: repeating the steps 1.1 to 1.3 on all the classified time sequence data in the step 1.1, thus obtaining segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, completing data preprocessing, and obtaining the processed time sequence data.
Step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; the basic classification layer outputs a new data set with the same size as the input data, and the method specifically comprises the following steps:
step 2.1: inputting the processed time sequence data obtained in the step 1 into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier, wherein the solving parameters specifically comprise the following steps:
step 2.1.1: for a processed time series numberAccording to O ═ O1,o2,o3…,oTUsing a forward-backward algorithm, calculating the occurrence probability P (O | λ) of the processed time series data when the hidden markov classifier λ is (a, B, pi). Wherein o is1,o2,o3…,oTRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;
step 2.1.2: for D processed time series data { (O)1),(O2),...,(OD) And calculating parameters A, B and pi of the hidden Markov classifier by using a Bohm-Welch algorithm. Wherein (O)i) I is 1,2, … D represents the i-th processed time-series data;
step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated1,o2,o3…,oTMost likely hidden state sequence
Figure BDA0003382081730000031
Wherein
Figure BDA0003382081730000032
A numerical value O indicating the time series data O after processing at the time iiIs hidden state.
Step 2.2: inputting the processed time sequence data obtained in the step 1 into a conditional variation self-encoder of a basic classification layer, calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain a weak classifier of the conditional variation self-encoder, and constructing the weak classifier of the conditional variation self-encoder comprises the following steps:
step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder2Where μ denotes normalMean value of distribution, σ2A variance representing a normal distribution;
step 2.2.2: the standard normal distribution N (0,1) is sampled to obtain a sample e. Mu, sigma output to step 2.2.1 neural network encoder2The sum sample belongs to the formula 1 to obtain
Figure BDA0003382081730000037
Figure BDA0003382081730000033
Wherein the content of the first and second substances,
Figure BDA0003382081730000038
obeying a normal distribution N (mu, sigma)2);
Step 2.2.3:
Figure BDA0003382081730000034
data with the same dimension as the processed time series data output by the neural network decoder
Figure BDA0003382081730000035
Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output data
Figure BDA0003382081730000036
As part of the optimization target error epsilon. And (3) performing multiple iterations to optimize the optimization target to obtain a trained neural network decoder, wherein the step of calculating the Wasserstein distance comprises the following steps:
step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization term
Figure BDA0003382081730000041
And performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows:
Figure BDA0003382081730000042
Figure BDA0003382081730000043
wherein p (x) represents a distribution function of the processed time series data, p (x)k) X represents the time-series data after processing at time kkThe probability of (d);
step 2.2.4.2: the Wasserstein distance was calculated using the Sinkhorn approximation algorithm to simplify the amount of computation. The calculation formula of the Wasserstein distance obtained by combining the entropy regular function of the step 2.2.4.1 is formula 2:
Figure BDA0003382081730000044
wherein the content of the first and second substances,
Figure BDA0003382081730000045
representing processed time series data O and neural network decoder output data
Figure BDA0003382081730000046
The distance of the Wasserstein of (1),
Figure BDA0003382081730000047
indicates at time n, from
Figure BDA0003382081730000048
Transfer to onThe cost function of (2);
step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;
Figure BDA0003382081730000049
wherein the content of the first and second substances,
Figure BDA00033820817300000410
is represented by
Figure BDA00033820817300000411
The reconstruction error of the reconstruction O is calculated in the formula 4.
Figure BDA00033820817300000412
Wherein the content of the first and second substances,
Figure BDA00033820817300000413
the result of the representation of the reconstructed O is
Figure BDA00033820817300000414
The probability of (d);
step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.
And step 3: establishing an integrated fusion layer, using the output of a basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;
step 3.1: and inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data.
Step 3.2: constructing a secondary learner, using the integrated training data set collected in the step 3.1 as training data of the secondary learner, so that the secondary learner learns the output of the basic classification layer, wherein the process of constructing the secondary learner comprises the following steps:
step 3.2.1: and (3) constructing a support vector machine between any two types of samples to classify by using the support vector machine classifier as a secondary learner. Thus, if the number of data classes of the integrated training data set is k, thenNeed to be constructed
Figure BDA0003382081730000051
A support vector machine classifier;
step 3.2.2: inputting an integrated training data set
Figure BDA0003382081730000052
Training the support vector machine classifiers, counting the classification classes of each support vector machine classifier for samples of unknown classes after training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class;
step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.
And 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.
Compared with the prior art, the invention has the beneficial effects that:
1) the method can extract the similarity characteristics of the time series data based on the hidden variable, integrate the discrete information and the continuous information of the time series and use the discrete information and the continuous information for data mining, and supplement the blank of the existing time series data mining method.
2) The invention introduces a condition variation self-encoder based on Wasserstein distance, changes KL divergence used for measurement in an original model into the Wasserstein distance, and performs approximate calculation by using a Sinkhorn algorithm, so that an implicit variable fits wider data distribution, and calculation resources are saved.
3) The invention introduces the ensemble learning, integrates and fuses a hidden Markov model for digging time sequence hidden state information and a condition variation self-encoder model for digitizing time sequence similarity information based on Wasserstein distance by utilizing a Stacking fusion model optimization algorithm, reduces the redundancy, enables the base learners to mutually make up for the deficiency, and improves the classification accuracy and the operation efficiency.
4) The method can be used for time series data abnormity detection and traffic flow data prediction, for time series data abnormity detection, the time series of normal conditions can be used as training data to be input into the integrated model for learning, and then the data to be detected is input into the model for classification to obtain whether the data to be detected is abnormal data. For traffic flow data prediction, consistent traffic flow data is marked and input into an integrated model for learning, data to be detected is input after learning is finished, whether the traffic flow is congested or not is judged, the defect that prediction of other methods on emergency is not timely can be overcome, and the robustness of the model is improved.
Drawings
Fig. 1 is an overall structural view of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
As shown in FIG. 1, a method for mining time series data similarity information based on an integration model constructs an input layer, a basic classification layer and an integration fusion layer. And inputting the original data into an input layer, and preprocessing the original data to obtain sample data. The basic classification layer comprises two weak classifiers which are respectively a hidden Markov weak classifier and a conditional variational self-coding weak classifier based on Wasserstein distance. Inputting the processed time sequence data into two weak classifiers in parallel for learning classification, and training and learning n hidden Markov weak classifiers lambda for each class of data1,λ2,…,λnInputting the processed time series data into all classifiers to obtain n probabilities p1,p2,…,pnAnd taking the class label of the maximum value in all the results as the final classification result of the hidden Markov weak classifier. Inputting the processed time sequence data into a conditional variational self-encoder weak classifier based on Wasserstein distance while using a hidden Markov weak classifier for learning training, sampling in the input to obtain a sample x, and obtaining normal distribution N (mu, sigma) by the x passing through a neural network encoder2) Sufficient statistics of: mean μ and variance σ2(ii) a Followed by N (mu, sigma)2) Sampling to obtain z, z output via neural network decoder
Figure BDA0003382081730000061
Thus, a new sample under the same distribution is generated by using a neural network decoder which is trained by learning
Figure BDA0003382081730000062
And finally, taking the outputs of the two weak classifiers as the input of the integrated fusion layer. The integrated algorithm can combine information in two directions during data mining, so that the model mining information is more, and the learning ability is stronger; meanwhile, the base learner can perform parallel training, and the operation efficiency is higher.
Referring to fig. 1, taking mining time series data as an example, the method includes the following steps:
step 1: and preprocessing the original time sequence data to obtain processed time sequence data. The original time sequence data refers to unclassified time sequence data directly acquired, and the original time sequence data can be classified into one or more categories, and the method specifically comprises the following steps:
step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;
step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data obtained in the step 1.1 into a signature vector sig (A) by using a minimum hash function;
step 1.3: after sig (a) of step 1.2 is obtained, sig (a) is divided into different segments, each having a segment bit flag.
Step 1.4: repeating the steps 1.1 to 1.3 on all the classified time sequence data in the step 1.1, thus obtaining segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, completing data preprocessing, and obtaining the processed time sequence data.
Step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; the basic classification layer outputs a new data set with the same size as the input data, and the method specifically comprises the following steps:
step 2.1: inputting the processed time sequence data obtained in the step 1 into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier, wherein the solving parameters specifically comprise the following steps:
step 2.1.1: for one processed time series data O ═ O1,o2,o3…,oTUsing a forward-backward algorithm, calculating the occurrence probability P (O | λ) of the processed time series data when the hidden markov classifier λ is (a, B, pi). Wherein o is1,o2,o3…,oTRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;
step 2.1.2: for D processed time series data { (O)1),(O2),...,(OD) And calculating parameters A, B and pi of the hidden Markov classifier by using a Bohm-Welch algorithm. Wherein (O)i) I is 1,2, … D represents the i-th processed time-series data;
step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated1,o2,o3…,oTMost likely hidden state sequenceColumn(s) of
Figure BDA0003382081730000081
Wherein
Figure BDA0003382081730000082
A numerical value O indicating the time series data O after processing at the time iiIs hidden state.
Step 2.2: inputting the processed time sequence data obtained in the step 1 into a conditional variation self-encoder of a basic classification layer, calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain a weak classifier of the conditional variation self-encoder, and constructing the weak classifier of the conditional variation self-encoder comprises the following steps:
step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder2Where μ denotes the mean of a normal distribution, σ2A variance representing a normal distribution;
step 2.2.2: the standard normal distribution N (0,1) is sampled to obtain a sample e. Mu, sigma output to step 2.2.1 neural network encoder2The sum sample belongs to the formula 1 to obtain
Figure BDA0003382081730000088
Figure BDA0003382081730000083
Wherein the content of the first and second substances,
Figure BDA0003382081730000084
obeying a normal distribution N (mu, sigma)2);
Step 2.2.3:
Figure BDA0003382081730000085
data with the same dimension as the processed time series data output by the neural network decoder
Figure BDA0003382081730000086
Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output data
Figure BDA0003382081730000087
As part of the optimization target error epsilon. And (3) performing multiple iterations to optimize the optimization target to obtain a trained neural network decoder, wherein the step of calculating the Wasserstein distance comprises the following steps:
step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization term
Figure BDA0003382081730000091
And performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows:
Figure BDA0003382081730000092
Figure BDA0003382081730000093
wherein p (x) represents a distribution function of the processed time series data, p (x)k) X represents the time-series data after processing at time kkThe probability of (d);
step 2.2.4.2: the Wasserstein distance was calculated using the Sinkhorn approximation algorithm to simplify the amount of computation. The calculation formula of the Wasserstein distance obtained by combining the entropy regular function of the step 2.2.4.1 is formula 2:
Figure BDA0003382081730000094
wherein the content of the first and second substances,
Figure BDA0003382081730000095
representing processed time series data O and neural network decoder output data
Figure BDA0003382081730000096
The distance of the Wasserstein of (1),
Figure BDA0003382081730000097
indicates at time n, from
Figure BDA0003382081730000098
Transfer to onThe cost function of (2);
step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;
Figure BDA0003382081730000099
wherein the content of the first and second substances,
Figure BDA00033820817300000910
is represented by
Figure BDA00033820817300000911
The reconstruction error of the reconstruction O is calculated in the formula 4.
Figure BDA00033820817300000912
Wherein the content of the first and second substances,
Figure BDA00033820817300000913
the result of the representation of the reconstructed O is
Figure BDA00033820817300000914
The probability of (d);
step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.
And step 3: establishing an integrated fusion layer, using the output of a basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;
step 3.1: and inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data.
Step 3.2: constructing a secondary learner, using the integrated training data set collected in the step 3.1 as training data of the secondary learner, so that the secondary learner learns the output of the basic classification layer, wherein the process of constructing the secondary learner comprises the following steps:
step 3.2.1: and (3) constructing a support vector machine between any two types of samples to classify by using the support vector machine classifier as a secondary learner. Therefore, if the number of data classes of the integrated training data set is k, then it is necessary to construct
Figure BDA0003382081730000101
A support vector machine classifier;
step 3.2.2: inputting an integrated training data set
Figure BDA0003382081730000102
Training the support vector machine classifiers, counting the classification classes of each support vector machine classifier for samples of unknown classes after training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class;
step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.
And 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.
The following is a specific application case of the invention in website identification. And taking the time series data of the five website traffics as experimental data, mining the similarity of the time series data through an integrated model to perform network identification, and confirming the category of the website traffic time series data of unknown category.
In this example, the raw time-series data is a time-series data having a length of 2525 and a width of 1024, the length indicating the number of acquisition times included in the time-series data of this example, the width indicating the website traffic volume corresponding to each acquisition time, and the raw time-series data including the traffic volume time-series data of 5 websites. Firstly, data preprocessing is carried out on original time sequence data, and a partial sensitive hashing algorithm is selected to delete partial data in the example, so that each signature vector is divided into six segments on the basis of minimum hashing processing data, data with the highest similarity is reserved, data with the low similarity is deleted, and noise interference is reduced. Then, 200 time series data of each type of website are selected as original time series data to obtain processed time series data. And next, inputting the processed time series data into the step 2 according to categories to construct a basic classification layer. Next, the processed time series data is inputted into the hidden markov weak classifier of step 2.1 of the present invention for hidden information mining, in this example, the inputted size is 1024 × 1. Meanwhile, the processed time series data are input into the Wasserstein distance-based conditional variation self-encoder weak classifier of step 2.2 of the invention in parallel to learn the distribution of the input time series data, and the size of the input data in this example is 1024 × 1. It should be noted that, as described in step 2.2.1, it is necessary to ensure that the sampled samples are uniformly distributed during sampling, so that the number of collected processed time series data for each class is the same. The sampled time series data is then learned according to steps 2.2.2 through 2.2.5 of the present invention. And after learning is finished, taking the output of the basic classification layer as the input of the integrated fusion layer, learning the input according to the step 3 of the invention, and after the learning of the integrated fusion layer is finished, using the integrated fusion layer for website identification. And inputting the flow time sequence data of the unknown website into the integrated fusion layer, wherein the obtained output classification is the category of the unknown website.
The invention may also be used in other industrial scenarios, such as the prediction of traffic flow. It should be noted that traffic flow prediction is a regression problem rather than a classification problem, and therefore, the output of the integrated model needs to be slightly modified, and the distribution of traffic flow data is learned by means of a conditional variation self-encoder based on Wasserstein distance, and the predicted flow for a certain time is output. The method comprises the following specific steps: first, the original time-series data are traffic flow time-series data of a certain place, the length of the original time-series data indicates the number of times included in the time-series, and the width indicates the traffic flow at a certain time. And secondly, preprocessing the original time sequence data to obtain processed time sequence data. And next, inputting the processed time series data into a basic classification layer for carrying out classification information mining to obtain output data. Next, the time when the basic classification layer outputs is taken as the input of a secondary learner in the integrated fusion layer to learn the traffic flow output by the basic classification layer, the output of the secondary learner is the traffic flow, and the output of the secondary learner is taken as the output of the integrated fusion layer. And finally, inputting the time needing to be predicted into the integrated fusion layer, wherein the obtained output is the predicted traffic flow. This makes predictions by mining similarity information for traffic flow time series data.
Meanwhile, the invention can also be used for time sequence data of other dimensions, such as identifying a voice source, wherein the voice time sequence data comprises three dimensions of pronunciation time, pronunciation duration and pronunciation interval, and one dimension is more than the website flow time sequence data and the traffic flow time sequence data in the above case. The voice time sequence data can be directly input into the basic classification layer for classification learning, then the output of the basic classification layer is used as the input of the integrated fusion layer for classification learning, finally the voice time sequence data of unknown classes are used as the input of the integrated fusion layer, and the output classes are the classes of the voice time sequence data of unknown classes. Thus, the similarity of the voice time sequence data of the same source can be used for identifying the voice source.

Claims (8)

1. A method for mining time series data similarity information based on an integrated model is characterized by comprising the following steps:
step 1: processing original time sequence data, and dividing the original time sequence data into one or more categories, wherein the original time sequence data refer to unclassified time sequence data which are directly acquired;
step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; outputting a new data set with the same size as the input data by the basic classification layer;
and step 3: establishing an integrated fusion layer, using the output of the basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through fusing a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;
and 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.
2. The method for mining similarity information of time-series data based on an integrated model according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;
step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data into a signature vector sig (A) by using a minimum hash function;
step 1.3: dividing sig (A) into different segments, each segment having a segment bit flag;
step 1.4: and repeating the steps 1.2 to 1.4 on all the classified time sequence data to obtain segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, and finishing data processing.
3. The method for mining time series data similarity information based on an integrated model according to claim 1, wherein the step 2 of establishing a basic classification layer specifically comprises the following steps:
step 2.1: inputting the processed time sequence data into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, and decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier;
step 2.2: and inputting the processed time sequence data into a conditional variation self-encoder of a basic classification layer, and calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain the weak classifier of the conditional variation self-encoder.
4. The method for mining similarity information of time series data based on integrated model according to claim 3, wherein the step 2.1 of constructing the hidden Markov weak classifier specifically comprises the following steps:
step 2.1.1: for one processed time series data O ═ O1,o2,o3…,oTCalculating the occurrence probability P (O | lambda) of the processed time sequence data in a hidden Markov classifier lambda (A, B, pi) by using a forward-backward algorithm; wherein o is1,o2,o3…,oTRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;
step 2.1.2: for D processed time series data { (O)1),(O2),...,(OD) Calculating by using a Bohm-Welch algorithm to obtain parameters A, B and pi of the hidden Markov classifier lambda; wherein (O)i) I is 1,2, … D represents the i-th processed time-series data;
step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated1,o2,o3…,oTMost likely hidden state sequence
Figure FDA0003382081720000021
Wherein
Figure FDA0003382081720000022
A numerical value O indicating the time series data O after processing at the time iiIs hidden state.
5. The method for mining similarity information of time series data based on integrated model according to claim 4, wherein the step 2.2 of constructing the conditional variant self-encoder weak classifier specifically comprises the following steps:
step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder2Where μ denotes the mean of a normal distribution, σ2A variance representing a normal distribution;
step 2.2.2: sampling the standard normal distribution N (0,1) to obtain a sample E, and determining the output mu, sigma of the neural network encoder in step 2.2.12The sum sample belongs to the formula 1 to obtain
Figure FDA0003382081720000031
Figure FDA0003382081720000032
Wherein the content of the first and second substances,
Figure FDA0003382081720000033
obeying a normal distribution N (mu, sigma)2);
Step 2.2.3:
Figure FDA0003382081720000034
obtaining data with same dimensionality of time sequence data after being output and processed by a neural network decoder
Figure FDA0003382081720000035
Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output data
Figure FDA0003382081720000036
The distance of the target is used as a part of the error epsilon of the optimization target, and the optimization target is optimized through multiple iterations to obtain a trained neural network decoder;
step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.
6. The method for mining similarity information of time-series data based on an integrated model according to claim 5, wherein the step 2.2.4 of calculating the Wasserstein distance specifically comprises the following steps:
step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization term
Figure FDA0003382081720000037
And performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows:
Figure FDA0003382081720000038
Figure FDA0003382081720000039
wherein p (x) represents a distribution function of the processed time series data, p (x)k) X represents the time-series data after processing at time kkThe probability of (d);
step 2.2.4.2: the Sinkhorn approximation algorithm is used for calculating the Wasserstein distance to simplify the calculated amount, and the calculation formula of the Wasserstein distance obtained by combining the entropy regulation function of the step 2.2.4.1 is shown as a formula 2:
Figure FDA00033820817200000310
wherein the content of the first and second substances,
Figure FDA0003382081720000041
representing processed time series data O and neural network decoder output data
Figure FDA0003382081720000042
The distance of the Wasserstein of (1),
Figure FDA0003382081720000043
indicates at time n, from
Figure FDA0003382081720000044
Transfer to onThe cost function of (2);
step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;
Figure FDA0003382081720000045
wherein the content of the first and second substances,
Figure FDA0003382081720000046
is represented by
Figure FDA0003382081720000047
Reconstructing the reconstruction error of the O in a calculation mode of a formula 4;
Figure FDA0003382081720000048
wherein the content of the first and second substances,
Figure FDA0003382081720000049
the result of the representation of the reconstructed O is
Figure FDA00033820817200000410
The probability of (c).
7. The method for mining the similarity information of the time-series data based on the integrated model according to claim 1, wherein the step 3 of constructing the integrated fusion layer comprises the following steps:
step 3.1: inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data;
step 3.2: constructing a secondary learner, and using the collected integrated training data set as training data of the secondary learner so that the secondary learner learns the output of the basic classification layer;
step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.
8. The method for mining similarity information of time-series data based on integrated model according to claim 7, wherein the step 3.2 of constructing the secondary learner comprises the following steps:
step 3.2.1: a support vector machine classifier is used as a secondary learner, a support vector machine is constructed between any two types of samples for classification, the number of data classes of an integrated training data set is k, and the support vector machine needs to be constructed
Figure FDA00033820817200000411
A support vector machine classifier;
step 3.2.2: inputting an integrated training data set
Figure FDA00033820817200000412
And (3) training the support vector machine classifiers, counting the classification classes of each support vector machine classifier for the samples of unknown classes after the training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class.
CN202111438131.4A 2021-11-29 2021-11-29 Method for mining time series data similarity information based on integrated model Pending CN114139624A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111438131.4A CN114139624A (en) 2021-11-29 2021-11-29 Method for mining time series data similarity information based on integrated model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111438131.4A CN114139624A (en) 2021-11-29 2021-11-29 Method for mining time series data similarity information based on integrated model

Publications (1)

Publication Number Publication Date
CN114139624A true CN114139624A (en) 2022-03-04

Family

ID=80389582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111438131.4A Pending CN114139624A (en) 2021-11-29 2021-11-29 Method for mining time series data similarity information based on integrated model

Country Status (1)

Country Link
CN (1) CN114139624A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599984A (en) * 2022-09-09 2023-01-13 北京理工大学(Cn) Retrieval method
CN116304358A (en) * 2023-05-17 2023-06-23 济南安迅科技有限公司 User data acquisition method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599984A (en) * 2022-09-09 2023-01-13 北京理工大学(Cn) Retrieval method
CN115599984B (en) * 2022-09-09 2023-06-09 北京理工大学 Retrieval method
CN116304358A (en) * 2023-05-17 2023-06-23 济南安迅科技有限公司 User data acquisition method
CN116304358B (en) * 2023-05-17 2023-08-08 济南安迅科技有限公司 User data acquisition method

Similar Documents

Publication Publication Date Title
CN110826638B (en) Zero sample image classification model based on repeated attention network and method thereof
CN111368920B (en) Quantum twin neural network-based classification method and face recognition method thereof
CN112015863A (en) Multi-feature fusion Chinese text classification method based on graph neural network
CN111914644A (en) Dual-mode cooperation based weak supervision time sequence action positioning method and system
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN114139624A (en) Method for mining time series data similarity information based on integrated model
CN111343147B (en) Network attack detection device and method based on deep learning
CN111507827A (en) Health risk assessment method, terminal and computer storage medium
CN113779260A (en) Domain map entity and relationship combined extraction method and system based on pre-training model
CN114881331A (en) Learner abnormal learning state prediction method facing online education
CN110751191A (en) Image classification method and system
CN110717602B (en) Noise data-based machine learning model robustness assessment method
CN116257759A (en) Structured data intelligent classification grading system of deep neural network model
CN114882069A (en) Taxi track abnormity detection method based on LSTM network and attention mechanism
CN114881173A (en) Resume classification method and device based on self-attention mechanism
CN111242028A (en) Remote sensing image ground object segmentation method based on U-Net
CN116467141A (en) Log recognition model training, log clustering method, related system and equipment
CN116451081A (en) Data drift detection method, device, terminal and storage medium
CN116304941A (en) Ocean data quality control method and device based on multi-model combination
CN113420733B (en) Efficient distributed big data acquisition implementation method and system
CN115391523A (en) Wind power plant multi-source heterogeneous data processing method and device
CN114169433A (en) Industrial fault prediction method based on federal learning + image learning + CNN
Wang et al. Shapelet classification algorithm based on efficient subsequence matching
Sangeetha et al. Crime Rate Prediction and Prevention: Unleashing the Power of Deep Learning
CN113378881B (en) Instruction set identification method and device based on information entropy gain SVM model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination