CN110852366A

CN110852366A - Data distance-preserving dimension reduction method containing missing data

Info

Publication number: CN110852366A
Application number: CN201911059239.5A
Authority: CN
Inventors: 从银川; 谢鲲; 欧阳与点; 文吉刚
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-02-28

Abstract

The invention discloses a data distance-keeping dimension reduction method containing missing data, and relates to the technical field of data processing. According to the data distance-keeping dimension reduction method, the missing part in the original data does not participate in the calculation of the loss function of the automatic encoder through the missing data matrix, so that the automatic encoder can process the dimension reduction containing the missing data, and the influence of the missing data on the automatic encoder is avoided; meanwhile, by means of strong automatic learning capacity of an automatic encoder, complex nonlinear relations among original data can be effectively captured, dimension reduction processing is enabled to have distance keeping performance by restricting and updating a weight matrix of the encoder in a loss function, distribution information of original high-dimensional data is reserved for low-dimensional data after dimension reduction to the greatest extent, subsequent data processing is facilitated, and data processing time and space are saved.

Description

Data distance-preserving dimension reduction method containing missing data

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data distance-preserving dimension reduction method containing missing data based on an automatic encoder.

Background

With the advent of the big data era and the popularization of electronic equipment, massive high-dimensional data is generated, great time and space expenses are generally needed for directly analyzing and processing the high-dimensional data, and dimension reduction is increasingly favored by people as an algorithm for mapping the high-dimensional data to a low-dimensional space and retaining original data information. Dimension reduction algorithm such as Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) is applied to reduce the dimension of high-dimensional data, which can bring great convenience to subsequent data processing. However, most of the data generated in reality contain missing data, and the traditional dimension reduction method cannot process the data with the missing data.

The distance-preserving dimensionality reduction refers to that the low-dimensional data after dimensionality reduction maintains the Euclidean distance between high-dimensional original data to a certain extent. Existing dimension reduction algorithms all have the capability of retaining high-dimensional data information, but maintain the distance of dimension reduction without display. It is generally considered that the distribution of the original data can be maintained by maintaining the Euclidean distance of the original data through distance-preserving dimension reduction, so that the data after dimension reduction maximally maintains the information between the sample points of the original data. Although the traditional distance-preserving dimension reduction algorithm is widely used in the data processing process, the traditional distance-preserving dimension reduction algorithm only adopts a linear model and cannot capture complex nonlinear information between high-dimensional data. In addition, the practical high-dimensional data not only has a complex nonlinear relationship, but also usually has the condition that some dimensional data of the data are missing, and the traditional distance-preserving dimension reduction method cannot effectively process the data with the missing.

In 2006, Hinton and Salakhutdinov proposed "Reducing the dimensional of data with Neural Networks" and applied an auto-encoder to data Dimensionality reduction. Such an automatic encoder with data dimension reduction function is a special fully-connected neural network, whose network structure has a symmetrical structure and its output is assumed to be equal to the input. For example, a three-layer structure is providedThe input of the dynamic encoder is

Output is as

Wherein the content of the first and second substances,

f is an activation function, w₁，w₂A weight matrix for an automatic encoder, b₁，b₂For the bias of the auto-encoder, s represents the size of the batch, and the loss function of the auto-encoder is

The automatic encoder adopts a back propagation algorithm for training, the weight and the bias are updated in the training process so that the value of a loss function reaches the minimum, and complex nonlinear information among high-dimensional data can be captured through continuously learning sample data. In order to prevent the situation that the auto-encoder learns an identity function instead of looking for structural information between data, the number of hidden layer nodes in the auto-encoder is usually limited to be much smaller than the number of input layer nodes. After training is completed, the dimension of data is reduced by using an encoder of an automatic encoder, and the output of the encoder

The data after dimension reduction.

Although the automatic encoder has the capability of learning the complex nonlinear structure of high-dimensional data, the automatic encoder cannot directly process the missing data. The dimension reduction is generally affected by missing data, a data filling method is adopted, the mean value or mode of the dimension feature is filled in the missing position, but the filling of the mean value or mode is lack of rationality, the dimension reduction effect and the subsequent data analysis are affected, and the like. At the same time, the auto-encoder constrains the post-dimensionality reduction keep-distance without display.

Random Projection (RP) is a linear dimension reduction idea with distance-keeping property, and sets dimension reduction dataRandom matrix

After reducing dimension

The data distance before and after dimension reduction meets the following requirements:

(1-α)||x_l-x_t||≤||c_l-c_t||≤(1+α)||x_l-x_t||

wherein α is a constant of 0 to 1, the smaller α is, the better the dimension reduction is, and c is_l，c_tFor a vector x to be reduced_l，x_tAnd (5) reducing the vector after dimension reduction. With the research of the stochastic projection algorithm, a plurality of methods for constructing the stochastic matrix R appear. As early as 1984, Johnson and Lindenstaus demonstrated that if random matrix R is a random unit orthogonal matrix, i.e., each column in R is orthogonal to each other and has a length of 1, the random projection achieved by random matrix R is distance preserving. Arriaga and Vempala in the article "An aluminum chemical of learning: the Robust contexts and random projection' proposes a "neuron-friendly" random projection, the greatest contribution of which is to prove that a random matrix R with a mean value of 0 and a variance of 1 has distance-preserving property, and also mentions that the only condition for the distance-preserving property of the random projection is that the mean value of elements in the random matrix R is 0, but the variance of the elements in the random matrix R can greatly influence the effect of reducing the dimension distance-preserving property. Ping et al propose a sparse random matrix R in the article "Very sparse random projects" with elements in the R being in terms of

Randomly taking the value of { -1, 0, -1} and verifyingThe random projection realized by the sparse random matrix has distance keeping property. The above stochastic projection process does not consider that adding an activation function may affect the distance-preserving property of stochastic projection, and the article "Signal recovery from l" by Bruna et al in 2014_pThe firing representation "proves that the single-layer neural network using the ReLU function or the sigmoid function has the function of distance-preserving embedding, namely:

A||x_l-x_t||≤||c_l-c_t||≤B||x_l-x_t||，0＜A≤B

a and B are constants, | | ★ | | | represents a two-norm of a vector, the above formula can be regarded as a relaxation version of distance keeping property proved by Johnson and Lindenstaus, and the neural network which uses ReLU or sigmood as an activation function and satisfies random projection of a weight matrix has certain distance keeping property.

Disclosure of Invention

Aiming at the problems that the traditional dimension reduction algorithm cannot carry out dimension reduction on missing data, cannot effectively learn complex nonlinear structure information of high-dimensional data and does not have distance maintenance in the dimension reduction process in the prior art, the invention provides a data distance maintenance dimension reduction method containing the missing data.

The invention solves the technical problems through the following technical scheme: a data distance-keeping dimension reduction method containing missing data comprises the following steps:

step 1: acquiring a sample data set, dividing the sample data set into a training sample set and a testing sample set, and vectorizing all samples in the sample data set to form a sample matrix; generating a missing data matrix corresponding to the sample matrix one by one according to the sample matrix, wherein the value of each element in the missing data matrix is 1 or 0, 1 represents that the position data in the sample matrix is normal, and 0 represents that the position data in the sample matrix is missing;

step 2: constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder;

and step 3: designing a loss function of the automatic encoder in the step 2 according to the missing data matrix in the step 1;

and 4, step 4: selecting a sample vector from the training sample set in the step 1 as the input of an automatic encoder, and calculating the value of the loss function in the step 3;

and 5: updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6;

step 6: and (4) performing dimensionality reduction on the sample matrix in the test sample set by adopting the trained automatic encoder.

The invention has the distance-preserving dimension-reducing method of the missing data, the missing data matrixes which are in one-to-one correspondence with the sample matrixes are generated according to the sample matrixes, and the loss functions of the automatic encoder are designed through the missing data matrixes, so that the missing data does not participate in the calculation of the loss functions, and the influence of the missing data on the automatic encoder is avoided; training an automatic encoder through a training sample set, and enabling the automatic encoder to continuously learn the feature information of different samples in the training sample set so as to learn the features of missing data and nonlinear information among data; in the training process, updating the weight matrix of the encoder in the loss function, enabling the updating of the weight matrix of the encoder to have a random projection characteristic, and enabling the automatic encoder to have a function of distance-preserving mapping in the dimension reduction process, namely enabling Euclidean distances between data sample points after dimension reduction to be consistent with Euclidean distances between corresponding original data sample points before dimension reduction in proportion to a certain degree; therefore, the dimensionality reduction method can achieve the purpose of distance-preserving dimensionality reduction processing on high-dimensional data with missing data, effectively learns nonlinear information among the high-dimensional data, enables the data after dimensionality reduction to keep own characteristic information of the data before dimensionality reduction and characteristic information among the data as much as possible, and saves time and space for processing the low-dimensional data.

Further, in step 2, the network structure of the encoder of the automatic encoder is input layer-first hidden layer-second hidden layer-third hidden layer-output layer, the network structure of the encoder of the automatic encoder is input layer-first hidden layer-second hidden layer, and the network structure of the decoder of the automatic encoder is second hidden layer-third hidden layer-output layer.

Further, in the step 3, the loss function L (W) of the automatic encoder_e，W_d，b_d) Comprises the following steps:

where | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, x_tFor the t-th input vector of the auto-encoder,

is x_tAs output vector at the input of the autoencoder, r_tIs given as the sum of the input vector x_tCorresponding missing data vector, x_t∈X，r_tE.g. R, X is a sample matrix, R is a missing data matrix corresponding to the sample matrix X,

representing k sample vectors and each having a length m, s being the size of one batch,

D(★；W_d，b_d) Representing a decoder, E (★; W)_e) Representing an encoder, W_eIs a weight matrix of the encoder, W_dIs a weight matrix of the decoder, b_dRepresenting the bias term of the decoder.

According to the calculation formula of the loss function, the loss function is zeroTheoretical input vector x_tWhat the missing data is, by comparison with the missing data vector r_tThe missing data part does not participate in the calculation of the loss function by element multiplication, so the influence of the missing data on an automatic encoder is eliminated, and the dimension reduction processing can be performed on the high-dimensional data containing the missing data without losing the rationality.

Further, in step 5, the weight matrix of the encoder in the loss function is updated by weight normalization processing, so that the average value of the updated weight matrix of the encoder is 0 and the variance is 1, and a specific weight normalization processing formula is as follows:

wherein, w_eiEncoder ith weight matrix, w 'before weight normalization processing for BP algorithm update'_eiIs a weight matrix w_eiWeight normalized matrix, mu_eiIs w_eiMean value of (a)_eiIs w_eiVariance of (E), E_eiIs a matrix with elements all 1.

Updating the encoder weight matrix to obtain an updated encoder weight matrix w'_eiThe mean value of (1) is 0 and the variance is 1, so that the condition that the random projection in the background technology has distance keeping property is met: the mean value of elements in the random matrix R is 0, and the variance is 1, so that the automatic encoder in the invention can process dimension reduction containing missing data, has a distance-maintaining characteristic for the dimension reduction, and can maintain the original distribution of high-dimensional data to the maximum extent.

Further, a regularization term is added on the basis of the loss function in the step 3, and the weight matrix of the encoder is updated according to the value of the loss function after the regularization term is added.

In order to avoid the influence on the stability of the automatic encoder training process caused by updating the weight matrix of the encoder through weight standardization processing, the weight matrix of the encoder is updated in a mode of adding a regularization item, the mean value of the updated weight matrix of the encoder is 0, the variance of the updated weight matrix of the encoder is 1, the condition with distance keeping performance is met, the dimension reduction processing of the automatic encoder has distance keeping characteristics, and redundant information can be removed through the regularization item.

Further, the regularization term L_rThe expression of (a) is:

the expression of the loss function after adding the regularization term is:

L_C＝L+αL_r

wherein, w_eiFor the ith weight matrix, σ, of the encoder_eiIs a weight matrix w_eiVariance of (d), μ_eiIs a weight matrix w_eiC is the number of encoder weight matrices, L is a loss function without adding a regularization term, L is a loss function_CTo add the loss function after the regularization term, α is a hyperparameter.

Further, the regularization term L_rThe expression of (a) is:

the expression of the loss function after adding the regularization term is:

L_C＝L+αL_r

wherein | ★ | purple_FIs the F norm of the matrix, w_eiFor the i-th weight matrix of the encoder,

is a weight matrix w_eiI is the identity matrix, c is the number of encoder weight matrices, L is the loss function without adding regularization terms, L is the number of encoder weight matrices, L is the number of regularization terms_CTo add the loss function after the regularization term, α is a hyperparameter.

Advantageous effects

Compared with the prior art, the data distance-preserving dimension reduction method containing the missing data provided by the invention has the advantages that the missing part in the original data does not participate in the calculation of the loss function of the automatic encoder through the missing data matrix, so that the automatic encoder can process the dimension reduction containing the missing data, and the influence of the missing data on the automatic encoder is avoided; meanwhile, by means of strong automatic learning capacity of an automatic encoder, complex nonlinear information among original data and characteristic information of the original data can be effectively captured, then the weight matrix of the encoder in the loss function is updated, the mean value of the weight matrix of the encoder is 0, the variance of the weight matrix of the encoder is 1 (the condition that random projection has distance keeping performance is met), dimension reduction processing has distance keeping performance, distribution information and self-specific information of original high-dimensional data are kept in the reduced low-dimensional data to the maximum extent, subsequent data processing is facilitated, and data processing time and space are saved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is an exemplary diagram of a handwritten digital picture in an embodiment of the invention;

FIG. 2 is a flow chart of a distance preserving dimension reduction method in an embodiment of the invention;

FIG. 3 is a network architecture diagram of an autoencoder in an embodiment of the present invention;

FIG. 4 is a test result in the absence of an MNIST data set in an embodiment of the present invention;

FIG. 5 is a test result in the case of a deletion in the MNIST dataset according to an embodiment of the present invention;

FIG. 6 is a distribution diagram of distance ratios before and after dimensionality reduction for the three methods in the absence of an MNIST data set in an embodiment of the present invention;

wherein 1-input layer, 2-first hidden layer, 3-second hidden layer, 4-third hidden layer, 5-output layer, 6-encoder, 7-decoder.

Detailed Description

The technical solutions in the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The MNIST data set is a large handwritten digital database collected and arranged by American national institute of standards and technology, the data set comprises 7 million pictures, the MNIST data set is divided into a training set and a testing set, the testing set comprises 10000 pictures, the training set comprises 60000 pictures, each picture is a handwritten digital picture (shown in figure 1) with 28 x 28 pixel points and 0-9, the training set comprises 60000 pictures, the testing set comprises white characters on a black bottom, the black bottom is represented by 0, the white characters are represented by floating point numbers between 0 and 1, and the closer to 1, the whiter the color. As shown in fig. 2, the specific implementation steps of the present invention are as follows:

1. loading a data set, dividing the data set into a training sample set and a testing sample set, vectorizing all samples (each sample is a digital picture consisting of 784 pixel points) in the data set to form a sample matrix X consisting of a plurality of sample vectors, wherein each sample vector in the sample matrix X is a one-dimensional array with the length of 784, and each element in the one-dimensional array is a floating point number between 0 and 1.

And generating a missing data matrix R corresponding to the sample matrix X one to one according to the sample matrix X, wherein the value of each element in the missing data matrix R is 1 or 0, 1 represents that the position data in the sample matrix X is normal, and 0 represents that the position data in the sample matrix X is missing. Let x_tFor the t-th sample vector, the sample vector,

is x_tAs output vector at the input of the autoencoder, r_tIs a sum sample vector x_tCorresponding missing data vector, x_t∈X，r_t∈R，

Representing k sample vectors, and each sample vector is m in length. In this example, m is 784.

2. And constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder.

As shown in fig. 3, the network structure of the auto encoder is input layer 1-first hidden layer 2-second hidden layer 3-third hidden layer 4-output layer 5, the network structure of the encoder 6 of the auto encoder is input layer 1-first hidden layer 2-second hidden layer 3, the network structure of the decoder 7 of the auto encoder is second hidden layer 3-third hidden layer 4-output layer 5, the number of nodes of the input layer 1 is 784, the number of nodes of the first hidden layer 2 is 200, the number of nodes of the second hidden layer 3 is 10, the number of nodes of the third hidden layer 4 is 200, the number of nodes of the output layer 5 is 784, and the dimension of the reduced sample vector is 10. In this embodiment, the activation function of the automatic encoder is a sigmiod function, and the initialization method is uniform distribution initialization.

3. Designing a loss function for an auto-encoder

The loss function expression of the auto-encoder is:

in the formula (1), | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, and x_tFor the t-th input vector of the auto-encoder,

is x_tAs output vector at the input of the autoencoder, r_tIs given as the sum of the input vector x_tThe corresponding missing data vector, s is the size of a batch,

w_e1，w_e2is a weight matrix of the encoder, w_d1，w_d2Is a weight matrix of the decoder, b_d1，b_d2For the offset term of the decoder, f is the activation function, and for maintaining the scalability, the encoder of the auto-encoder has no offset term.

As can be seen from equation (1), no matter the input vector x_tWhat the missing data is, by comparison with the missing data vector r_tThe missing data part does not participate in the calculation of the loss function by element multiplication one by one, so the influence of the missing data on the automatic encoder is eliminated, and the automatic encoder can perform dimensionality reduction processing on high-dimensional data containing the missing data without losing rationality.

4. Randomly selecting a sample vector from a training sample set as an input of an automatic encoder, dividing the sample vector into n batches, wherein the number of the sample vectors in each batch is s, and calculating the value of the loss function in the formula (1).

5. Updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; and after all batches of training are finished, judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6.

The invention provides three different methods for restricting the updating of the weight matrix of the encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has the random projection characteristic.

The first method is a weight normalization method, which sets w_e1Updating a weight matrix, w ', before weight normalization processing for an input layer to first hidden layer BP algorithm'_e1Is a weight matrix w_e1Weight normalized matrix, w_e2Weight matrix, w 'before weight normalization processing after updating of BP algorithm from first hidden layer to second hidden layer'_e2Is a weight matrix w_e2Weight normalized matrix, mu_e1、μ_e2Are respectively w_e1、w_e2Mean value of (a)_e1、σ_e2Are respectively w_e1、w_e2Variance of (E), E_e1、E_e2Is a matrix with elements all being 1,

the weight normalization process formula is:

constraining encoder weight matrix w by weight normalization processing method_e1、w_e2Updating of (2) to make the updated weight matrix w'_e1、w′_e2Has a mean value of 0 and a variance of 1, and satisfies the condition (random matrix w ') that the random projection in the background art has distance keeping property'_e1、w′_e2The mean value of the inner elements is 0, and the variance is 1), so that the dimension reduction processing of the automatic encoder has distance keeping property, the BP algorithm is firstly updated, and then the weight standardization processing is carried out, so as to reduce the copy error.

Therefore, another method for constraining the update of the encoder weight matrix is proposed in this embodiment, that is, a regularization term is added, and a regularization term is added on the basis of an original automatic encoder loss function formula (1) to constrain the update of the encoder weight matrix, where one regularization term is added to make a mean value of encoder weight elements 0 and a variance 1, and another regularization term is added to make a unit orthogonal regularization term, and specific expressions of the two regularization terms are as follows, and a regularization term expression of the first encoder weight normalization is (a second constraint method):

the expression of the loss function after adding the regularization term is:

L_C＝L+αL_r(4)

wherein, w_eiFor the ith weight matrix, σ, of the encoder_eiIs a weight matrix w_eiVariance of (d), μ_eiIs a weight matrix w_eiThe number of encoder weight matrices is 2, L is a loss function without adding a regularization term, L is a loss function_CIn order to increase the loss function after the regularization term, α is a hyperparameter, α adopts a test method to determine the value, an initial value is set firstly, whether the hyperparameter α meets the requirement is confirmed through the CV value or the topK score, if not, the test is continued until the CV value is smaller or the topK score is higher.

The second unit orthogonal regularization term expression is (third constraint method):

the expression of the loss function after adding the regularization term is:

L_C＝L+αL_r(6)

wherein | ★ | purple_FIs the F-norm of the matrix and,

is a weight matrix w_eiI is an identity matrix.

And after the regularization item is added, calculating the value of the loss function, and updating the encoder weight matrix according to the value of the loss function to enable the mean value of the encoder weight matrix to be 0 and the variance to be 1. In this embodiment, the BP algorithm is used to update the automatic encoder, after the weight matrix of the automatic encoder is updated, if the preset training frequency is not reached, the step 4 is returned to continue training the automatic encoder, so that the value of the loss function is reduced until the preset training frequency is reached (the automatic encoder converges), otherwise, the step 6 is shifted to indicate that the automatic encoder has been trained.

6. And (3) carrying out dimension reduction treatment on the sample matrix in the test sample set by adopting a trained automatic encoder, wherein the dimension of the output vector after dimension reduction is 10.

To evaluate whether the dimension reduction treatment has stability, this example employs two methods for evaluation: the first evaluation method is the TopK algorithm, the T0pK algorithm calculates the K neighbors of the samples (in this example, K is 10) using the distance between the samples, compares the number of K neighbors overlapped before and after the dimensionality reduction of the samples, and calculates the TopK score, and the TopK algorithm is the prior art.

Respectively representing the first K neighbor sets before and after dimensionality reduction of the ith sample, representing the number of elements in the set N by | N | and representing the number of samples by N, wherein the expression of the TopK algorithm is as follows:

the dimension reduction method with higher TopK score can better save the distance between high-dimensional data before dimension reduction.

The second evaluation method is a CV evaluation method, and the stability of dimension reduction processing is more intuitively represented by a distance ratio distribution graph before and after dimension reduction. And randomly selecting a plurality of pairs of sample points, calculating the ratio of the distance before dimension reduction to the distance after dimension reduction of each pair of sample points, and drawing a distribution diagram of the ratio of the distance before dimension reduction to the distance after dimension reduction of each pair of sample points in the batch. Coefficient of Variation (CV), also called Coefficient of dispersion, is a normalized measure for evaluating the degree of discretization of a probability distribution, and its value is defined as the ratio of the standard deviation to the mean. In the embodiment, the variation coefficient is used for describing the discrete degree of the distribution of the distance ratio before and after dimensionality reduction, and the dimensionality reduction method with a small CV value has good dimensionality reduction stability.

A sample vector representing the t-th dimension to be reduced,

representing the reduced dimension vector of the t-th sample, c_t＝f(w_e1x_t)w_d1. And calculating the CV value and the TopK score of the sample after the dimensionality reduction, and drawing a distribution diagram of the distance ratio before and after the dimensionality reduction.

Fig. 4-6 show experimental results, fig. 4 shows results after different dimensionality reduction processes in the case where the MNIST dataset has no missing data, in fig. 4, ortho denotes an automatic encoder using a unit orthogonal regularization term (third method), meanstd _ hard denotes an automatic encoder using an encoder weight normalization process (first method), measurstd _ soft denotes an automatic encoder using a regularization term to normalize an encoder weight (second method), AE denotes a general automatic encoder, and PCA denotes a principal component analysis algorithm. Since the initialization methods of the activation function and the auto-encoder affect the test result, the two activation functions of ReLU and sigmoid are used for comparison in fig. 4, the initialization methods use uniform distribution initialization (uniform) and gaussian distribution initialization (normal) for comparison, the bold italic part in fig. 4 represents the best result of the test, as can be seen from fig. 4, the CV values of the three constraint methods proposed by the present invention are all lower than that of the ordinary auto-encoder, the CV value of the ortho method is the lowest, the TopK scores of the three constraint methods proposed by the present invention in the TopK algorithm scores are all higher than those of the ordinary auto-encoder and the PCA algorithm, and the TopK score of the ortho method is the highest, so it can be seen that the feasibility and superiority of the distance-preserving and dimension-reducing method of the present invention on the dimension-reducing distance.

Fig. 5 shows the results after different dimensionality reduction processes are performed under the condition that the MNIST data set manually adds missing data, the abscissa shows the proportion of the missing data, the ordinate shows the TopK score, the activation function selects the sigmoid function, and the initialization method selects uniform distribution initialization.

Fig. 6 is a distribution diagram of distance ratios before and after the dimension reduction by the three methods of the improved auto encoder ortho _ AE, the normal auto encoder AE, and the principal component analysis PCA without missing the MNIST data set, the abscissa indicates the ratio of the distances between the same pair of sample points before and after the dimension reduction, and the ordinate indicates the probability density, and it can be seen from fig. 6 that CV values of the improved auto encoder ortho _ AE, the normal auto encoder AE, and the principal component analysis PCA are 0.153, 0.247, and 0.162, respectively, the CV value of the improved auto encoder ortho _ AE is the smallest, and it indicates that the dimensional stability of the improved auto encoder ortho _ AE is the best.

The above disclosure is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or modifications within the technical scope of the present invention, and shall be covered by the scope of the present invention.

Claims

1. A data distance-keeping dimension reduction method containing missing data is characterized by comprising the following steps:

2. The data distance maintenance dimensionality reduction method according to claim 1, wherein in the step 2, the network structure of the automatic encoder is an input layer-a first hidden layer-a second hidden layer-a third hidden layer-an output layer, the network structure of the encoder of the automatic encoder is an input layer-a first hidden layer-a second hidden layer, and the network structure of the decoder of the automatic encoder is a second hidden layer-a third hidden layer-an output layer.

3. The data distance maintenance dimensionality reduction method of claim 1, wherein in the step 3, a loss function L (W) of an automatic encoder_e，W_d，b_d) Comprises the following steps:

where | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, x_tFor the t-th input vector of the auto-encoder,is x_tAs output vector at the input of the autoencoder, r_tIs given as the sum of the input vector x_tCorresponding missing data vector, x_t∈X，r_tE.g. R, X is a sample matrix, R is a missing data matrix corresponding to the sample matrix X,

4. The data distance preserving dimension reduction method according to claim 1, wherein in the step 5, the weight matrix of the encoder in the loss function is updated by weight normalization processing, so that the mean value and the variance of the updated weight matrix of the encoder are 0 and 1, and the specific weight normalization processing formula is as follows:

5. The data distance maintenance dimensionality reduction method according to claim 1, wherein a regularization term is added on the basis of the loss function in the step 3, and the weight matrix of the encoder is updated according to the value of the loss function after the regularization term is added.

6. The data-retention dimensionality reduction method of claim 5, wherein the regularization term L_rThe expression of (a) is:

the expression of the loss function after adding the regularization term is:

L_C＝L+αL_r

7. The data-retention dimensionality reduction method of claim 5, wherein the regularization term L_rThe expression of (a) is:

the expression of the loss function after adding the regularization term is:

L_C＝L+αL_r