CN116401528A

CN116401528A - Multi-element time sequence unsupervised dimension reduction method based on global-local divergence

Info

Publication number: CN116401528A
Application number: CN202310278160.1A
Authority: CN
Inventors: 李正欣; 胡钢; 刘嘉; 吴虎胜; 吴丹阳; 刘斌; 周漩; 杨波; 吴诗辉
Original assignee: Air Force Engineering University of PLA
Current assignee: Air Force Engineering University of PLA
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-07-07

Abstract

The invention discloses a multi-element time sequence unsupervised dimension reduction method based on global-local divergence, which comprises the following steps: s1, calculating a covariance matrix of a multi-element time sequence, extracting upper triangular elements of the covariance matrix, and combining the upper triangular elements into a feature sequence to obtain a feature set Fea= { f of the multi-element time sequence _i I = 1,2, …, n }; s2, establishing a neighborhood set N for the feature set by using k neighbor and Euclidean distance (EuclidDistance, ED) measurement _k (f _i )＝{f _j I j = 1,2, …, k }; s3, after finding the neighborhood of each sample point, calculating each feature sequence f in the feature set _i Neighborhood center sequence m of (2) _i The method comprises the steps of carrying out a first treatment on the surface of the S4, representing the local divergence by using the neighborhood variance of the projected sample points, firstly calculating the variance of the neighborhood set of each sample point after projection, and then accumulating and summing the variances to obtain the local divergence; and S5, calculating the variance of the domain center point according to the domain center point obtained in the step S3 to obtain the global variance. Finally, the experimental result shows that the low-dimensional projection sequence obtained by the method can representThe original MTS realizes more obvious dimension reduction.

Description

Multi-element time sequence unsupervised dimension reduction method based on global-local divergence

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a multi-element time sequence unsupervised dimension reduction method based on global-local divergence.

Background

Time series generally refers to a series of time-varying data acquired by various sensors, which are widely used in the fields of environmental science, medicine, finance, etc., and can be classified into a unitary time series (univariate time series, UTS) and a multiple time series (multiple timeseries, MTS) according to the number of variables. MTS can be seen as a combination of corresponding UTS generated by different factors in the same system, which in contrast to UTS not only contains high dimensionality in the time dimension, but also in the feature dimension, and there is a correlation between features. Thus, data mining for MTS is more complex than UTS.

At present, data mining of MTS is applied to industrial fault monitoring, and data in fault monitoring are mostly obtained through sensors, so that a large amount of noise information exists, characteristic dimensions and time dimensions exist in monitoring data collected by the sensors, and data processing is difficult and low in efficiency due to high latitude during fault monitoring. Therefore, how to effectively dimension down the multiple time series data in fault monitoring is the key to solve the problem.

Time series data mining generally includes clustering, classification, prediction, anomaly detection, correlation analysis, etc., and these mining tasks are generally related to the size and complexity of data, and MTS has high-dimensional characteristics of two dimensions at the same time, so that when performing the mining tasks, it is generally required to perform dimension reduction or feature representation to reduce the complexity of data and mitigate interference caused by redundant information, and currently, the prior art is mainly divided into dimension reduction for feature dimensions, dimension reduction for time dimensions, and dimension reduction for both dimensions at the same time. The technical problem that the time length of MTS with different lengths is not equal can not be solved by dimension reduction aiming at the characteristic dimension, and difficulty exists in the similarity measurement of the subsequent data mining; the dimension reduction of the time dimension ignores the variable correlation characteristics in the characteristic dimension, and the problems of information redundancy and the like can exist after the dimension reduction, and the dimension reduction of the time dimension and the variable correlation characteristics generally needs to use a two-way dimension reduction technology, so that the calculation cost is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a multi-element time sequence unsupervised dimension reduction method based on global-local divergence.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the multi-element time sequence unsupervised dimension reduction method based on global-local divergence is characterized by comprising the following steps:

s1: acquiring fault monitoring data through a sensor, forming a multi-element time sequence of the acquired fault monitoring data into a multi-element time sequence original data set D, respectively calculating covariance matrixes of each multi-element time sequence in the data set, extracting upper triangle elements of the covariance matrixes, and combining the upper triangle elements into a characteristic sequence; all the feature sequences form a feature sequence set, and the length of each feature sequence is the same;

s2: based on the obtained feature sequence set, measuring by using k neighbor numbers and Euclidean distances to establish each sample neighbor set in the feature sequence set;

s3: calculating a neighborhood center sequence of each neighborhood set in the feature set according to the neighborhood sets obtained in the step S2;

s4: calculating the variance of each neighborhood set of the sample point to be projected according to the neighborhood sets obtained in the step S2, and then accumulating and summing the variances to calculate the local divergence;

s5: calculating a neighborhood global variance according to the neighborhood central point obtained in the step S3 to obtain a global divergence;

s6: according to the local divergence and the global divergence obtained in the steps S4 and S5, solving a projection matrix;

s7: according to the projection matrix obtained in the step S6, the feature sequence set obtained in the step S1 is projected, so that a dimension-reduced feature sequence is obtained:

y _i ＝W ^T f _i (6)；

s8: according to the dimension reduction feature sequence, a dimension reduction feature set D' = { y of the fault monitoring data is obtained _i I=1, 2, …, n }, wherein,

s9: and processing the fault monitoring data after the dimension reduction to obtain a fault monitoring result.

Further, the specific operation steps of the step 1 include:

s11: the obtained fault monitoring data is subjected to multi-element time series to form a multi-element time series original data set D= { X _i I=1, 2, …, n, where n is the number of samples,

representing the ith MTS sample, x in the original dataset _i (i=1, 2, …, m) represents a series of observations of the ith variable, m is the variable number, t _i Is the time length of the ith multiple time series; and zero-equalizing each multi-element time sequence: x is X _i ＝X _i -E(X _i )；

S12: covariance matrix calculation is carried out on each multi-element time sequence after zero-mean treatment, and the covariance matrix of the ith MTS is as follows:

s13: covariance matrix

Extracting the upper triangular matrix elements of the row vector, and sequentially forming the row vector:

f _i ＝(a ₁₁ ,a ₁₂ ,…,a _1m ,a ₂₁ ,…,a _2m ,…,a _mm ) (8)

will f _i As the feature sequence of the ith MTS, a feature set fea= { f of the multivariate time series is obtained _i |i＝1,2,…,n}。

Further, in step S3, a neighborhood center sequence m is calculated _i The calculation formula of (2) is as follows:

where k is the number of neighbors, m _i Is f _i A neighborhood center sequence of k neighbors to the neighborhood.

Further, the specific step of S4 includes:

s41: based on the neighborhood sets, calculating variances of each neighborhood set of the projected sample points, and accumulating and summing the variances to obtain a local divergence calculation formula:

wherein p is _i ＝W ^T m _i A low-dimensional projection for a neighborhood center point; y is _j The method is characterized by comprising the steps of low-dimensional projection of sample points, wherein W is a projection matrix, and subscript L is abbreviation of English Local;

s42: the conversion of formula (2) can be obtained:

wherein, the liquid crystal display device comprises a liquid crystal display device,

for the ith feature sequence f _i Is a local divergence matrix of:

further, the specific step of S5 includes:

s51: calculating a neighborhood global variance according to a neighborhood center point in the neighborhood center sequence to obtain a global divergence calculation formula:

wherein p is _i ＝W ^T m _i A low-dimensional projection for a neighborhood center point; subscript G is an abbreviation for Global;

s52: the transformation of formula (3) can be obtained:

wherein S is _G Is a global divergence matrix:

s53: the formula (12) is simplified to obtain:

is the center point of all neighborhood centers, namely the global neighborhood center.

Further, the specific step of S6 includes:

s61: combining the local divergence obtained in the formula (9) and the global divergence obtained in the formula (11) to obtain the formula (4):

s62: converting the formula (4) into a generalized eigenvalue solution problem, and obtaining a projection matrix by solving the formula (4):

S _G ω＝λS _L ω (5)

wherein: omega is a generalized eigenvector, lambda is a generalized eigenvalue;

s63: solving (5) to obtain the first d maximum eigenvalues lambda ₁ ,λ ₂ ,…,λ _d (d < p), corresponding feature vector ω ₁ ,ω ₂ ,…,ω _d Projection matrix w= (ω) ₁ ,ω ₂ ,…,ω _d )。

Compared with the prior art, the invention has the beneficial effects that:

the invention firstly provides a feature sequence extraction method, which extracts the upper triangle element of a multi-element time sequence covariance matrix and combines the upper triangle element into a feature sequence. Then, an unsupervised dimension reduction model is provided by taking the basic ideas of minimum local divergence and maximum global divergence, and global information is reserved as much as possible while the local neighbor relation is maintained. And taking the characteristic sequence as input, minimizing the sum of all sample point neighborhood variances, and maximizing the neighborhood center point variance. The projection matrix obtained by solving the model can realize dimension reduction of the multi-element time sequence; finally, the dimension reduction method and the related comparison method provided by the invention are subjected to experimental verification through 20 groups of public data sets, and experimental results show that the dimension reduction method provided by the invention can effectively reduce the dimension of the MTS data set, thereby improving the fault monitoring accuracy.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention discloses a multi-element time sequence unsupervised dimension reduction method based on global-local divergence, which specifically comprises the following steps:

s1, acquiring fault monitoring numbers through a sensor, acquiring corresponding multi-element time sequences, forming a multi-element time sequence original data set D, and converting the multi-element time sequence original data set D into an equal-length characteristic sequence set Fea= { f _i |i＝1,2,…,n}；

Specifically, the specific operation steps of S1 include:

s11, forming a multi-element time sequence into a multi-element time sequence original data set D= { X _i I=1, 2, …, n, where n is the number of samples,

represents the ith MTS sample, x in the dataset _i (i=1, 2, …, m) represents a series of observations of the ith variable, m is the variable number, t _i Is the length of time of the ith MTS, zero-mean processing is performed on each MTS, i.e. X _i ＝X _i -E(X _i )；

S12, performing covariance matrix calculation on each zero-mean-processed multi-element time sequence, wherein the covariance matrix of the ith MTS is as follows:

wherein the covariance matrix

Is a symmetrical array;

will sigma _i Is extracted from the upper triangular matrix element of the matrix element, and the row vectors are formed in sequence:

f _i ＝(a ₁₁ ,a ₁₂ ,…,a _1m ,a ₂₁ ,…,a _2m ,…,a _mm ) (8)

will f _i As the feature sequence of the ith MTS, the feature set fea= { f of MTS is obtained _i |i＝1,2,…,n}。

S2, based on the obtained feature set Fea, measuring by using k neighbor and Euclidean distance (EuclidDistance, ED), and obtaining a neighbor set N of each sample in the feature set Fea for the feature set _k (f _i )＝{f _j |j＝1,2,…,k}

S3, calculating a neighborhood center sequence m of each neighborhood set in the feature set Fea according to the neighborhood set obtained in the step S2 _i ，m _i The calculation formula of (2) is as follows:

where k is the number of neighbors, m _i Is f _i A neighborhood center sequence of k neighboring points;

s4, according to the neighborhood sets obtained in the step S2, variances of each neighborhood set of the sample points after projection are obtained, and then the variances are accumulated and summed to obtain a local divergence calculation formula:

wherein p is _i ＝W ^T m _i A low-dimensional projection for a neighborhood center point; y is _j The method refers to low-dimensional projection of sample points, W is a projection matrix, and subscript L is abbreviation of English Local (Local);

specifically, S4 further includes:

s41, converting the formula (2) to obtain:

S _Li for the ith feature sequence f _i Is a local divergence matrix of (1), namely:

s5, calculating a neighborhood global variance according to the neighborhood central point obtained in the step S3 to obtain a global divergence:

specifically, S5 further includes:

s51, transforming the formula (3)

Wherein the method comprises the steps of

S52, S in (12) _G As a global divergence matrix, it is simplified to obtain:

the center point of all neighborhood centers is called a global neighborhood center;

s6, combining the local divergence (formula (9)) and the global divergence (formula (11)) obtained in the steps S4 and S5 to obtain formula (4), and obtaining a projection matrix W= (omega) by solving the formula (4) ₁ ,ω ₂ ,…,ω _d )：

Wherein, the formula (4) is converted into a generalized eigenvalue solution problem:

S _G ω＝λS _L ω (5)

where ω is a generalized eigenvector and λ is a generalized eigenvalue.

By solving the equation (5), the first d largest eigenvalues λ can be obtained ₁ ,λ ₂ ,…,λ _d (d < p) and corresponding feature vector omega ₁ ,ω ₂ ,…,ω _d And projection matrix w= (ω) ₁ ,ω ₂ ,…,ω _d )；

S7, according to the projection matrix obtained in the step S6, projecting sample points (namely elements in the feature sequence set Fea) in the feature sequence set obtained in the step S1 to obtain a dimension-reduction feature sequence:

y _i ＝W ^T f _i (6)

finally, a dimension reduction feature set D' = { y is obtained _i I=1, 2, …, n }, wherein,

examples

In order to verify the dimension reduction method (hereinafter referred to as GLSUP) of the present invention, a correlation experiment was performed.

1. Data set selection

Let MTS dataset d= { X _i I=1, 2, …, n, where n is the number of samples,

is the ith MTS, x in the data set _i (i=1, 2, …, m) is a series of observations of the ith variable, m is the number of variables, and the feature sequence length is (m+1) m/2. Assuming that the average length of MTS in the dataset is t, the selected low-dimensional feature number is d (d < m) ² )。

2. Algorithm complexity

The training process of the invention mainly comprises three parts: feature extraction, model solving and projection. In the feature extraction stage, the calculation cost is mainly the calculation of MTS covariance matrix, and the calculation complexity is O (nm) ² t). In the model solving stage, the calculation cost is mainly the calculation of the local divergence matrix and the generalized eigenvalue solution, and the calculation complexity is O (nm) ⁴ ) And O (m) ⁶ ). After the projection matrix W is obtained, the GLSUP method projects the feature sequence in the feature set Fea, and the computation complexity is O (dm ² ). Thus, the complexity of the GLSUP method is O (nm ² t+nm ⁴ +m ⁶ )。

3. MTS data dimension reduction experimental process

In this example, 20 sets of multi-element time series data for different fields were selected, see Table 1, including LP1, LP2, LP3, LP4, LP5, daily and SportsActivities (DSA), fingerMovements (FM), handMovementDirection (HMD), NATOPS, cricket, racketSports (RS), epilepsy, basicMotions (BM), LSST, articularyWordRecognition (AWR), EEGeye, wafer, walkvsRun (WR), kickvsPunch (KP), australian Sign Language (ASL), respectively. The first 15 are equal length data sets and the second 5 are unequal length data sets.

Table 1 MTS data set information

The dimension reduction effectiveness refers to the degree of characterization of MTS features by the reserved information after the dimension reduction of the data. The experiment inputs the dimension reduction result into a KNN (k=1) classifier, and the dimension reduction effectiveness is evaluated through classification accuracy. The experimental procedure can thus be described as: and performing dimension reduction on the original MTS data set by using a dimension reduction algorithm to obtain a dimension reduction data set. And sequentially selecting samples from the dimensionality reduction data set, inputting the samples into a classifier, obtaining 1 sample most similar to the queried sample by using nearest neighbor query, taking the sample label value as the category to which the queried sample belongs, and if the sample label value is consistent with the queried sample label value, correctly classifying, otherwise, incorrectly classifying. After performing the operation on all samples, the classification accuracy is obtained:

ε＝n _true /n

wherein epsilon is the classification precision, n _true For the number of correctly classified samples, n is the number of samples.

5 unsupervised dimension reduction methods such as PCA, CPCA, LPP, PBLDA, VPCA are selected as comparison methods. Since the VPCA method is only applicable to equal length data sets, it was only tested on 15 sets of equal length data sets. On the MTS data sets with different lengths, the dimension reduction results of the PCA and CPCA methods are still sequences with different lengths. Euclidean distance can only measure sequences of equal length. Therefore, in the 5 groups of unequal-length data sets, the dimension reduction results of the PCA and CPCA methods are measured by using dynamic time warping (Dynamic Time Warping, DTW) distances, so that KNN classification is realized.

The PCA, CPCA, VPCA method involves the following parameters: variance contribution sigma; the parameters involved in the LPP method are: neighbor number k, thermonuclear parameter t, low-dimensional feature number d; parameters involved in the PBLDA method are: feature dimension p _c Dimension p of time _r The method comprises the steps of carrying out a first treatment on the surface of the The parameters involved in the GLSUP method are: neighbor number k, low-dimensional feature number d. In the experiment, variance contribution ratio σ of PCA, CPCA and VPCA methods was set to 80%. The neighbor number k and the thermonuclear parameter t are both set to 1. Time dimension p _r The length of the original time sequence is kept unchanged, and the characteristic dimension p is adjusted _c And the low-dimensional feature number d to obtain the optimal matching precision of PBLDA, GLSUP, LPP three dimension reduction algorithms.

The results of the dimension reduction effectiveness test are shown in table 2, and the highest classification precision of each row in the table is shown in a thickening way. From the results, the GLSUP method has a good classification accuracy in all 20 data sets. The GLSUP method converts MTS into an equal-length characteristic sequence, retains correlation information among different variables, considers global and local information of a data set, and realizes dimension reduction on the characteristic sequence.

TABLE 2 dimension reduction effectiveness test results

4. Conclusion of the experiment

For other methods, CPCA method has a greater improvement in the dimension reduction effectiveness compared to PCA methods. The reason is that the former projects the MTS to a common low-dimensional subspace, while the latter projects to a different low-dimensional subspace. However, both methods reduce the dimension only for the variable dimension, without reducing the sequence length. The VPCA method has high classification accuracy, but can only be used for equal-length data sets. The PBLDA method performs dimension reduction from variable dimension and time dimension, but the dimension reduction effect is not good on part of the data set, because when the data sets with different lengths are faced, the PBLDA method adopts a cut-off mode to convert the sequences with different lengths into sequences with equal lengths, which causes information loss. The LPP method extracts the feature sequence from the MTS and performs LPP dimension reduction, solves the problem of unequal length, but only considers the local information of the data set and ignores the global information. In addition, the method needs singular value decomposition for each MTS, and has higher calculation complexity.

In addition, the GLSUP method has significant advantages over other methods in the multi-category dataset ASL. The reason is that the GLSUP method takes global and local information of the data set into account, and samples can be projected into clusters with better separability in the unlabeled data set.

In summary, the dimension reduction method uses the covariance matrix of the MTS as the feature sequence of the MTS, and can convert MTS with unequal time length into the feature sequence with equal length; and the equal-length characteristic sequences are projected to the same common low-dimensional space, and the obtained low-dimensional projection sequences can represent the original MTS, so that a relatively obvious dimension reduction effect is realized.

The above embodiments are merely preferred embodiments for fully explaining the present invention, and the scope of the present invention is not limited thereto. Equivalent substitutions and changes are intended by those skilled in the art on the basis of the present invention, and are within the scope of the present invention. The protection scope of the invention is subject to the claims.

Claims

1. The multi-element time sequence unsupervised dimension reduction method based on global-local divergence is characterized by comprising the following steps:

y _i ＝W ^T f _i (6)；

2. The method for unsupervised dimension reduction of a multivariate time series based on global-local divergence as set forth in claim 1, wherein the specific operation steps of step 1 include:

s13: covariance matrix

f _i ＝(a ₁₁ ,a ₁₂ ,…,a _1m ,a ₂₁ ,…,a _2m ,…,a _mm ) (8)

3. The multi-element time series unsupervised dimension reduction method based on global-local divergence according to claim 2, wherein: step S3, calculating a neighborhood center sequence m _i The calculation formula of (2) is as follows:

4. A multi-element time series unsupervised dimension reduction method based on global-local divergence as claimed in claim 3, wherein the specific step of S4 comprises:

s42: the conversion of formula (2) can be obtained:

for the ith feature sequence f _i Is a local divergence matrix of:

5. the method for unsupervised dimension reduction of a multivariate time series based on global-local divergence as set forth in claim 4, wherein the specific step of S5 comprises:

s52: the transformation of formula (3) can be obtained:

wherein S is _G Is a global divergence matrix:

s53: the formula (12) is simplified to obtain:

6. The method for unsupervised dimension reduction of a multivariate time series based on global-local divergence as set forth in claim 5, wherein the specific step of S6 comprises:

S _G ω＝λS _L ω (5)