CN110852366A - Data distance-preserving dimension reduction method containing missing data - Google Patents
Data distance-preserving dimension reduction method containing missing data Download PDFInfo
- Publication number
- CN110852366A CN110852366A CN201911059239.5A CN201911059239A CN110852366A CN 110852366 A CN110852366 A CN 110852366A CN 201911059239 A CN201911059239 A CN 201911059239A CN 110852366 A CN110852366 A CN 110852366A
- Authority
- CN
- China
- Prior art keywords
- encoder
- data
- matrix
- sample
- loss function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 81
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000011159 matrix material Substances 0.000 claims abstract description 123
- 238000012545 processing Methods 0.000 claims abstract description 28
- 239000013598 vector Substances 0.000 claims description 45
- 238000012549 training Methods 0.000 claims description 30
- 230000014509 gene expression Effects 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 11
- 238000011423 initialization method Methods 0.000 claims description 7
- 238000012423 maintenance Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 abstract description 54
- 230000008569 process Effects 0.000 abstract description 13
- 238000009826 distribution Methods 0.000 abstract description 12
- 238000004364 calculation method Methods 0.000 abstract description 6
- 239000010410 layer Substances 0.000 description 31
- 101100481876 Danio rerio pbk gene Proteins 0.000 description 12
- 101100481878 Mus musculus Pbk gene Proteins 0.000 description 12
- 238000000513 principal component analysis Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000011946 reduction process Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000009827 uniform distribution Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- XAGFODPZIPBFFR-UHFFFAOYSA-N aluminium Chemical compound [Al] XAGFODPZIPBFFR-UHFFFAOYSA-N 0.000 description 1
- 229910052782 aluminium Inorganic materials 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000010304 firing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a data distance-keeping dimension reduction method containing missing data, and relates to the technical field of data processing. According to the data distance-keeping dimension reduction method, the missing part in the original data does not participate in the calculation of the loss function of the automatic encoder through the missing data matrix, so that the automatic encoder can process the dimension reduction containing the missing data, and the influence of the missing data on the automatic encoder is avoided; meanwhile, by means of strong automatic learning capacity of an automatic encoder, complex nonlinear relations among original data can be effectively captured, dimension reduction processing is enabled to have distance keeping performance by restricting and updating a weight matrix of the encoder in a loss function, distribution information of original high-dimensional data is reserved for low-dimensional data after dimension reduction to the greatest extent, subsequent data processing is facilitated, and data processing time and space are saved.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a data distance-preserving dimension reduction method containing missing data based on an automatic encoder.
Background
With the advent of the big data era and the popularization of electronic equipment, massive high-dimensional data is generated, great time and space expenses are generally needed for directly analyzing and processing the high-dimensional data, and dimension reduction is increasingly favored by people as an algorithm for mapping the high-dimensional data to a low-dimensional space and retaining original data information. Dimension reduction algorithm such as Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) is applied to reduce the dimension of high-dimensional data, which can bring great convenience to subsequent data processing. However, most of the data generated in reality contain missing data, and the traditional dimension reduction method cannot process the data with the missing data.
The distance-preserving dimensionality reduction refers to that the low-dimensional data after dimensionality reduction maintains the Euclidean distance between high-dimensional original data to a certain extent. Existing dimension reduction algorithms all have the capability of retaining high-dimensional data information, but maintain the distance of dimension reduction without display. It is generally considered that the distribution of the original data can be maintained by maintaining the Euclidean distance of the original data through distance-preserving dimension reduction, so that the data after dimension reduction maximally maintains the information between the sample points of the original data. Although the traditional distance-preserving dimension reduction algorithm is widely used in the data processing process, the traditional distance-preserving dimension reduction algorithm only adopts a linear model and cannot capture complex nonlinear information between high-dimensional data. In addition, the practical high-dimensional data not only has a complex nonlinear relationship, but also usually has the condition that some dimensional data of the data are missing, and the traditional distance-preserving dimension reduction method cannot effectively process the data with the missing.
In 2006, Hinton and Salakhutdinov proposed "Reducing the dimensional of data with Neural Networks" and applied an auto-encoder to data Dimensionality reduction. Such an automatic encoder with data dimension reduction function is a special fully-connected neural network, whose network structure has a symmetrical structure and its output is assumed to be equal to the input. For example, a three-layer structure is providedThe input of the dynamic encoder is Output is asWherein the content of the first and second substances, f is an activation function, w1,w2A weight matrix for an automatic encoder, b1,b2For the bias of the auto-encoder, s represents the size of the batch, and the loss function of the auto-encoder isThe automatic encoder adopts a back propagation algorithm for training, the weight and the bias are updated in the training process so that the value of a loss function reaches the minimum, and complex nonlinear information among high-dimensional data can be captured through continuously learning sample data. In order to prevent the situation that the auto-encoder learns an identity function instead of looking for structural information between data, the number of hidden layer nodes in the auto-encoder is usually limited to be much smaller than the number of input layer nodes. After training is completed, the dimension of data is reduced by using an encoder of an automatic encoder, and the output of the encoderThe data after dimension reduction.
Although the automatic encoder has the capability of learning the complex nonlinear structure of high-dimensional data, the automatic encoder cannot directly process the missing data. The dimension reduction is generally affected by missing data, a data filling method is adopted, the mean value or mode of the dimension feature is filled in the missing position, but the filling of the mean value or mode is lack of rationality, the dimension reduction effect and the subsequent data analysis are affected, and the like. At the same time, the auto-encoder constrains the post-dimensionality reduction keep-distance without display.
Random Projection (RP) is a linear dimension reduction idea with distance-keeping property, and sets dimension reduction dataRandom matrixAfter reducing dimensionThe data distance before and after dimension reduction meets the following requirements:
(1-α)||xl-xt||≤||cl-ct||≤(1+α)||xl-xt||
wherein α is a constant of 0 to 1, the smaller α is, the better the dimension reduction is, and c isl,ctFor a vector x to be reducedl,xtAnd (5) reducing the vector after dimension reduction. With the research of the stochastic projection algorithm, a plurality of methods for constructing the stochastic matrix R appear. As early as 1984, Johnson and Lindenstaus demonstrated that if random matrix R is a random unit orthogonal matrix, i.e., each column in R is orthogonal to each other and has a length of 1, the random projection achieved by random matrix R is distance preserving. Arriaga and Vempala in the article "An aluminum chemical of learning: the Robust contexts and random projection' proposes a "neuron-friendly" random projection, the greatest contribution of which is to prove that a random matrix R with a mean value of 0 and a variance of 1 has distance-preserving property, and also mentions that the only condition for the distance-preserving property of the random projection is that the mean value of elements in the random matrix R is 0, but the variance of the elements in the random matrix R can greatly influence the effect of reducing the dimension distance-preserving property. Ping et al propose a sparse random matrix R in the article "Very sparse random projects" with elements in the R being in terms ofRandomly taking the value of { -1, 0, -1} and verifyingThe random projection realized by the sparse random matrix has distance keeping property. The above stochastic projection process does not consider that adding an activation function may affect the distance-preserving property of stochastic projection, and the article "Signal recovery from l" by Bruna et al in 2014pThe firing representation "proves that the single-layer neural network using the ReLU function or the sigmoid function has the function of distance-preserving embedding, namely:
A||xl-xt||≤||cl-ct||≤B||xl-xt||,0<A≤B
a and B are constants, | | ★ | | | represents a two-norm of a vector, the above formula can be regarded as a relaxation version of distance keeping property proved by Johnson and Lindenstaus, and the neural network which uses ReLU or sigmood as an activation function and satisfies random projection of a weight matrix has certain distance keeping property.
Disclosure of Invention
Aiming at the problems that the traditional dimension reduction algorithm cannot carry out dimension reduction on missing data, cannot effectively learn complex nonlinear structure information of high-dimensional data and does not have distance maintenance in the dimension reduction process in the prior art, the invention provides a data distance maintenance dimension reduction method containing the missing data.
The invention solves the technical problems through the following technical scheme: a data distance-keeping dimension reduction method containing missing data comprises the following steps:
step 1: acquiring a sample data set, dividing the sample data set into a training sample set and a testing sample set, and vectorizing all samples in the sample data set to form a sample matrix; generating a missing data matrix corresponding to the sample matrix one by one according to the sample matrix, wherein the value of each element in the missing data matrix is 1 or 0, 1 represents that the position data in the sample matrix is normal, and 0 represents that the position data in the sample matrix is missing;
step 2: constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder;
and step 3: designing a loss function of the automatic encoder in the step 2 according to the missing data matrix in the step 1;
and 4, step 4: selecting a sample vector from the training sample set in the step 1 as the input of an automatic encoder, and calculating the value of the loss function in the step 3;
and 5: updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6;
step 6: and (4) performing dimensionality reduction on the sample matrix in the test sample set by adopting the trained automatic encoder.
The invention has the distance-preserving dimension-reducing method of the missing data, the missing data matrixes which are in one-to-one correspondence with the sample matrixes are generated according to the sample matrixes, and the loss functions of the automatic encoder are designed through the missing data matrixes, so that the missing data does not participate in the calculation of the loss functions, and the influence of the missing data on the automatic encoder is avoided; training an automatic encoder through a training sample set, and enabling the automatic encoder to continuously learn the feature information of different samples in the training sample set so as to learn the features of missing data and nonlinear information among data; in the training process, updating the weight matrix of the encoder in the loss function, enabling the updating of the weight matrix of the encoder to have a random projection characteristic, and enabling the automatic encoder to have a function of distance-preserving mapping in the dimension reduction process, namely enabling Euclidean distances between data sample points after dimension reduction to be consistent with Euclidean distances between corresponding original data sample points before dimension reduction in proportion to a certain degree; therefore, the dimensionality reduction method can achieve the purpose of distance-preserving dimensionality reduction processing on high-dimensional data with missing data, effectively learns nonlinear information among the high-dimensional data, enables the data after dimensionality reduction to keep own characteristic information of the data before dimensionality reduction and characteristic information among the data as much as possible, and saves time and space for processing the low-dimensional data.
Further, in step 2, the network structure of the encoder of the automatic encoder is input layer-first hidden layer-second hidden layer-third hidden layer-output layer, the network structure of the encoder of the automatic encoder is input layer-first hidden layer-second hidden layer, and the network structure of the decoder of the automatic encoder is second hidden layer-third hidden layer-output layer.
Further, in the step 3, the loss function L (W) of the automatic encodere,Wd,bd) Comprises the following steps:
where | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, xtFor the t-th input vector of the auto-encoder,is xtAs output vector at the input of the autoencoder, rtIs given as the sum of the input vector xtCorresponding missing data vector, xt∈X,rtE.g. R, X is a sample matrix, R is a missing data matrix corresponding to the sample matrix X,representing k sample vectors and each having a length m, s being the size of one batch,D(★;Wd,bd) Representing a decoder, E (★; W)e) Representing an encoder, WeIs a weight matrix of the encoder, WdIs a weight matrix of the decoder, bdRepresenting the bias term of the decoder.
According to the calculation formula of the loss function, the loss function is zeroTheoretical input vector xtWhat the missing data is, by comparison with the missing data vector rtThe missing data part does not participate in the calculation of the loss function by element multiplication, so the influence of the missing data on an automatic encoder is eliminated, and the dimension reduction processing can be performed on the high-dimensional data containing the missing data without losing the rationality.
Further, in step 5, the weight matrix of the encoder in the loss function is updated by weight normalization processing, so that the average value of the updated weight matrix of the encoder is 0 and the variance is 1, and a specific weight normalization processing formula is as follows:
wherein, weiEncoder ith weight matrix, w 'before weight normalization processing for BP algorithm update'eiIs a weight matrix weiWeight normalized matrix, mueiIs weiMean value of (a)eiIs weiVariance of (E), EeiIs a matrix with elements all 1.
Updating the encoder weight matrix to obtain an updated encoder weight matrix w'eiThe mean value of (1) is 0 and the variance is 1, so that the condition that the random projection in the background technology has distance keeping property is met: the mean value of elements in the random matrix R is 0, and the variance is 1, so that the automatic encoder in the invention can process dimension reduction containing missing data, has a distance-maintaining characteristic for the dimension reduction, and can maintain the original distribution of high-dimensional data to the maximum extent.
Further, a regularization term is added on the basis of the loss function in the step 3, and the weight matrix of the encoder is updated according to the value of the loss function after the regularization term is added.
In order to avoid the influence on the stability of the automatic encoder training process caused by updating the weight matrix of the encoder through weight standardization processing, the weight matrix of the encoder is updated in a mode of adding a regularization item, the mean value of the updated weight matrix of the encoder is 0, the variance of the updated weight matrix of the encoder is 1, the condition with distance keeping performance is met, the dimension reduction processing of the automatic encoder has distance keeping characteristics, and redundant information can be removed through the regularization item.
Further, the regularization term LrThe expression of (a) is:
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein, weiFor the ith weight matrix, σ, of the encodereiIs a weight matrix weiVariance of (d), μeiIs a weight matrix weiC is the number of encoder weight matrices, L is a loss function without adding a regularization term, L is a loss functionCTo add the loss function after the regularization term, α is a hyperparameter.
Further, the regularization term LrThe expression of (a) is:
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein | ★ | purpleFIs the F norm of the matrix, weiFor the i-th weight matrix of the encoder,is a weight matrix weiI is the identity matrix, c is the number of encoder weight matrices, L is the loss function without adding regularization terms, L is the number of encoder weight matrices, L is the number of regularization termsCTo add the loss function after the regularization term, α is a hyperparameter.
Advantageous effects
Compared with the prior art, the data distance-preserving dimension reduction method containing the missing data provided by the invention has the advantages that the missing part in the original data does not participate in the calculation of the loss function of the automatic encoder through the missing data matrix, so that the automatic encoder can process the dimension reduction containing the missing data, and the influence of the missing data on the automatic encoder is avoided; meanwhile, by means of strong automatic learning capacity of an automatic encoder, complex nonlinear information among original data and characteristic information of the original data can be effectively captured, then the weight matrix of the encoder in the loss function is updated, the mean value of the weight matrix of the encoder is 0, the variance of the weight matrix of the encoder is 1 (the condition that random projection has distance keeping performance is met), dimension reduction processing has distance keeping performance, distribution information and self-specific information of original high-dimensional data are kept in the reduced low-dimensional data to the maximum extent, subsequent data processing is facilitated, and data processing time and space are saved.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is an exemplary diagram of a handwritten digital picture in an embodiment of the invention;
FIG. 2 is a flow chart of a distance preserving dimension reduction method in an embodiment of the invention;
FIG. 3 is a network architecture diagram of an autoencoder in an embodiment of the present invention;
FIG. 4 is a test result in the absence of an MNIST data set in an embodiment of the present invention;
FIG. 5 is a test result in the case of a deletion in the MNIST dataset according to an embodiment of the present invention;
FIG. 6 is a distribution diagram of distance ratios before and after dimensionality reduction for the three methods in the absence of an MNIST data set in an embodiment of the present invention;
wherein 1-input layer, 2-first hidden layer, 3-second hidden layer, 4-third hidden layer, 5-output layer, 6-encoder, 7-decoder.
Detailed Description
The technical solutions in the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The MNIST data set is a large handwritten digital database collected and arranged by American national institute of standards and technology, the data set comprises 7 million pictures, the MNIST data set is divided into a training set and a testing set, the testing set comprises 10000 pictures, the training set comprises 60000 pictures, each picture is a handwritten digital picture (shown in figure 1) with 28 x 28 pixel points and 0-9, the training set comprises 60000 pictures, the testing set comprises white characters on a black bottom, the black bottom is represented by 0, the white characters are represented by floating point numbers between 0 and 1, and the closer to 1, the whiter the color. As shown in fig. 2, the specific implementation steps of the present invention are as follows:
1. loading a data set, dividing the data set into a training sample set and a testing sample set, vectorizing all samples (each sample is a digital picture consisting of 784 pixel points) in the data set to form a sample matrix X consisting of a plurality of sample vectors, wherein each sample vector in the sample matrix X is a one-dimensional array with the length of 784, and each element in the one-dimensional array is a floating point number between 0 and 1.
And generating a missing data matrix R corresponding to the sample matrix X one to one according to the sample matrix X, wherein the value of each element in the missing data matrix R is 1 or 0, 1 represents that the position data in the sample matrix X is normal, and 0 represents that the position data in the sample matrix X is missing. Let xtFor the t-th sample vector, the sample vector,is xtAs output vector at the input of the autoencoder, rtIs a sum sample vector xtCorresponding missing data vector, xt∈X,rt∈R, Representing k sample vectors, and each sample vector is m in length. In this example, m is 784.
2. And constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder.
As shown in fig. 3, the network structure of the auto encoder is input layer 1-first hidden layer 2-second hidden layer 3-third hidden layer 4-output layer 5, the network structure of the encoder 6 of the auto encoder is input layer 1-first hidden layer 2-second hidden layer 3, the network structure of the decoder 7 of the auto encoder is second hidden layer 3-third hidden layer 4-output layer 5, the number of nodes of the input layer 1 is 784, the number of nodes of the first hidden layer 2 is 200, the number of nodes of the second hidden layer 3 is 10, the number of nodes of the third hidden layer 4 is 200, the number of nodes of the output layer 5 is 784, and the dimension of the reduced sample vector is 10. In this embodiment, the activation function of the automatic encoder is a sigmiod function, and the initialization method is uniform distribution initialization.
3. Designing a loss function for an auto-encoder
The loss function expression of the auto-encoder is:
in the formula (1), | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, and xtFor the t-th input vector of the auto-encoder,is xtAs output vector at the input of the autoencoder, rtIs given as the sum of the input vector xtThe corresponding missing data vector, s is the size of a batch, we1,we2is a weight matrix of the encoder, wd1,wd2Is a weight matrix of the decoder, bd1,bd2For the offset term of the decoder, f is the activation function, and for maintaining the scalability, the encoder of the auto-encoder has no offset term.
As can be seen from equation (1), no matter the input vector xtWhat the missing data is, by comparison with the missing data vector rtThe missing data part does not participate in the calculation of the loss function by element multiplication one by one, so the influence of the missing data on the automatic encoder is eliminated, and the automatic encoder can perform dimensionality reduction processing on high-dimensional data containing the missing data without losing rationality.
4. Randomly selecting a sample vector from a training sample set as an input of an automatic encoder, dividing the sample vector into n batches, wherein the number of the sample vectors in each batch is s, and calculating the value of the loss function in the formula (1).
5. Updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; and after all batches of training are finished, judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6.
The invention provides three different methods for restricting the updating of the weight matrix of the encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has the random projection characteristic.
The first method is a weight normalization method, which sets we1Updating a weight matrix, w ', before weight normalization processing for an input layer to first hidden layer BP algorithm'e1Is a weight matrix we1Weight normalized matrix, we2Weight matrix, w 'before weight normalization processing after updating of BP algorithm from first hidden layer to second hidden layer'e2Is a weight matrix we2Weight normalized matrix, mue1、μe2Are respectively we1、we2Mean value of (a)e1、σe2Are respectively we1、we2Variance of (E), Ee1、Ee2Is a matrix with elements all being 1, the weight normalization process formula is:
constraining encoder weight matrix w by weight normalization processing methode1、we2Updating of (2) to make the updated weight matrix w'e1、w′e2Has a mean value of 0 and a variance of 1, and satisfies the condition (random matrix w ') that the random projection in the background art has distance keeping property'e1、w′e2The mean value of the inner elements is 0, and the variance is 1), so that the dimension reduction processing of the automatic encoder has distance keeping property, the BP algorithm is firstly updated, and then the weight standardization processing is carried out, so as to reduce the copy error.
Therefore, another method for constraining the update of the encoder weight matrix is proposed in this embodiment, that is, a regularization term is added, and a regularization term is added on the basis of an original automatic encoder loss function formula (1) to constrain the update of the encoder weight matrix, where one regularization term is added to make a mean value of encoder weight elements 0 and a variance 1, and another regularization term is added to make a unit orthogonal regularization term, and specific expressions of the two regularization terms are as follows, and a regularization term expression of the first encoder weight normalization is (a second constraint method):
the expression of the loss function after adding the regularization term is:
LC=L+αLr(4)
wherein, weiFor the ith weight matrix, σ, of the encodereiIs a weight matrix weiVariance of (d), μeiIs a weight matrix weiThe number of encoder weight matrices is 2, L is a loss function without adding a regularization term, L is a loss functionCIn order to increase the loss function after the regularization term, α is a hyperparameter, α adopts a test method to determine the value, an initial value is set firstly, whether the hyperparameter α meets the requirement is confirmed through the CV value or the topK score, if not, the test is continued until the CV value is smaller or the topK score is higher.
The second unit orthogonal regularization term expression is (third constraint method):
the expression of the loss function after adding the regularization term is:
LC=L+αLr(6)
And after the regularization item is added, calculating the value of the loss function, and updating the encoder weight matrix according to the value of the loss function to enable the mean value of the encoder weight matrix to be 0 and the variance to be 1. In this embodiment, the BP algorithm is used to update the automatic encoder, after the weight matrix of the automatic encoder is updated, if the preset training frequency is not reached, the step 4 is returned to continue training the automatic encoder, so that the value of the loss function is reduced until the preset training frequency is reached (the automatic encoder converges), otherwise, the step 6 is shifted to indicate that the automatic encoder has been trained.
6. And (3) carrying out dimension reduction treatment on the sample matrix in the test sample set by adopting a trained automatic encoder, wherein the dimension of the output vector after dimension reduction is 10.
To evaluate whether the dimension reduction treatment has stability, this example employs two methods for evaluation: the first evaluation method is the TopK algorithm, the T0pK algorithm calculates the K neighbors of the samples (in this example, K is 10) using the distance between the samples, compares the number of K neighbors overlapped before and after the dimensionality reduction of the samples, and calculates the TopK score, and the TopK algorithm is the prior art.Respectively representing the first K neighbor sets before and after dimensionality reduction of the ith sample, representing the number of elements in the set N by | N | and representing the number of samples by N, wherein the expression of the TopK algorithm is as follows:
the dimension reduction method with higher TopK score can better save the distance between high-dimensional data before dimension reduction.
The second evaluation method is a CV evaluation method, and the stability of dimension reduction processing is more intuitively represented by a distance ratio distribution graph before and after dimension reduction. And randomly selecting a plurality of pairs of sample points, calculating the ratio of the distance before dimension reduction to the distance after dimension reduction of each pair of sample points, and drawing a distribution diagram of the ratio of the distance before dimension reduction to the distance after dimension reduction of each pair of sample points in the batch. Coefficient of Variation (CV), also called Coefficient of dispersion, is a normalized measure for evaluating the degree of discretization of a probability distribution, and its value is defined as the ratio of the standard deviation to the mean. In the embodiment, the variation coefficient is used for describing the discrete degree of the distribution of the distance ratio before and after dimensionality reduction, and the dimensionality reduction method with a small CV value has good dimensionality reduction stability.A sample vector representing the t-th dimension to be reduced,representing the reduced dimension vector of the t-th sample, ct=f(we1xt)wd1. And calculating the CV value and the TopK score of the sample after the dimensionality reduction, and drawing a distribution diagram of the distance ratio before and after the dimensionality reduction.
Fig. 4-6 show experimental results, fig. 4 shows results after different dimensionality reduction processes in the case where the MNIST dataset has no missing data, in fig. 4, ortho denotes an automatic encoder using a unit orthogonal regularization term (third method), meanstd _ hard denotes an automatic encoder using an encoder weight normalization process (first method), measurstd _ soft denotes an automatic encoder using a regularization term to normalize an encoder weight (second method), AE denotes a general automatic encoder, and PCA denotes a principal component analysis algorithm. Since the initialization methods of the activation function and the auto-encoder affect the test result, the two activation functions of ReLU and sigmoid are used for comparison in fig. 4, the initialization methods use uniform distribution initialization (uniform) and gaussian distribution initialization (normal) for comparison, the bold italic part in fig. 4 represents the best result of the test, as can be seen from fig. 4, the CV values of the three constraint methods proposed by the present invention are all lower than that of the ordinary auto-encoder, the CV value of the ortho method is the lowest, the TopK scores of the three constraint methods proposed by the present invention in the TopK algorithm scores are all higher than those of the ordinary auto-encoder and the PCA algorithm, and the TopK score of the ortho method is the highest, so it can be seen that the feasibility and superiority of the distance-preserving and dimension-reducing method of the present invention on the dimension-reducing distance.
Fig. 5 shows the results after different dimensionality reduction processes are performed under the condition that the MNIST data set manually adds missing data, the abscissa shows the proportion of the missing data, the ordinate shows the TopK score, the activation function selects the sigmoid function, and the initialization method selects uniform distribution initialization.
Fig. 6 is a distribution diagram of distance ratios before and after the dimension reduction by the three methods of the improved auto encoder ortho _ AE, the normal auto encoder AE, and the principal component analysis PCA without missing the MNIST data set, the abscissa indicates the ratio of the distances between the same pair of sample points before and after the dimension reduction, and the ordinate indicates the probability density, and it can be seen from fig. 6 that CV values of the improved auto encoder ortho _ AE, the normal auto encoder AE, and the principal component analysis PCA are 0.153, 0.247, and 0.162, respectively, the CV value of the improved auto encoder ortho _ AE is the smallest, and it indicates that the dimensional stability of the improved auto encoder ortho _ AE is the best.
The above disclosure is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or modifications within the technical scope of the present invention, and shall be covered by the scope of the present invention.
Claims (7)
1. A data distance-keeping dimension reduction method containing missing data is characterized by comprising the following steps:
step 1: acquiring a sample data set, dividing the sample data set into a training sample set and a testing sample set, and vectorizing all samples in the sample data set to form a sample matrix; generating a missing data matrix corresponding to the sample matrix one by one according to the sample matrix, wherein the value of each element in the missing data matrix is 1 or 0, 1 represents that the position data in the sample matrix is normal, and 0 represents that the position data in the sample matrix is missing;
step 2: constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder;
and step 3: designing a loss function of the automatic encoder in the step 2 according to the missing data matrix in the step 1;
and 4, step 4: selecting a sample vector from the training sample set in the step 1 as the input of an automatic encoder, and calculating the value of the loss function in the step 3;
and 5: updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6;
step 6: and (4) performing dimensionality reduction on the sample matrix in the test sample set by adopting the trained automatic encoder.
2. The data distance maintenance dimensionality reduction method according to claim 1, wherein in the step 2, the network structure of the automatic encoder is an input layer-a first hidden layer-a second hidden layer-a third hidden layer-an output layer, the network structure of the encoder of the automatic encoder is an input layer-a first hidden layer-a second hidden layer, and the network structure of the decoder of the automatic encoder is a second hidden layer-a third hidden layer-an output layer.
3. The data distance maintenance dimensionality reduction method of claim 1, wherein in the step 3, a loss function L (W) of an automatic encodere,Wd,bd) Comprises the following steps:
where | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, xtFor the t-th input vector of the auto-encoder,is xtAs output vector at the input of the autoencoder, rtIs given as the sum of the input vector xtCorresponding missing data vector, xt∈X,rtE.g. R, X is a sample matrix, R is a missing data matrix corresponding to the sample matrix X, representing k sample vectors and each having a length m, s being the size of one batch,D(★;Wd,bd) Representing a decoder, E (★; W)e) Representing an encoder, WeIs a weight matrix of the encoder, WdIs a weight matrix of the decoder, bdRepresenting the bias term of the decoder.
4. The data distance preserving dimension reduction method according to claim 1, wherein in the step 5, the weight matrix of the encoder in the loss function is updated by weight normalization processing, so that the mean value and the variance of the updated weight matrix of the encoder are 0 and 1, and the specific weight normalization processing formula is as follows:
wherein, weiEncoder ith weight matrix, w 'before weight normalization processing for BP algorithm update'eiIs a weight matrix weiWeight normalized matrix, mueiIs weiMean value of (a)eiIs weiVariance of (E), EeiIs a matrix with elements all 1.
5. The data distance maintenance dimensionality reduction method according to claim 1, wherein a regularization term is added on the basis of the loss function in the step 3, and the weight matrix of the encoder is updated according to the value of the loss function after the regularization term is added.
6. The data-retention dimensionality reduction method of claim 5, wherein the regularization term LrThe expression of (a) is:
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein, weiFor the ith weight matrix, σ, of the encodereiIs a weight matrix weiVariance of (d), μeiIs a weight matrix weiC is the number of encoder weight matrices, L is a loss function without adding a regularization term, L is a loss functionCTo add the loss function after the regularization term, α is a hyperparameter.
7. The data-retention dimensionality reduction method of claim 5, wherein the regularization term LrThe expression of (a) is:
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein | ★ | purpleFIs the F norm of the matrix, weiFor the i-th weight matrix of the encoder,is a weight matrix weiI is the identity matrix, c is the number of encoder weight matrices, L is the loss function without adding regularization terms, L is the number of encoder weight matrices, L is the number of regularization termsCTo add the loss function after the regularization term, α is a hyperparameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911059239.5A CN110852366A (en) | 2019-11-01 | 2019-11-01 | Data distance-preserving dimension reduction method containing missing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911059239.5A CN110852366A (en) | 2019-11-01 | 2019-11-01 | Data distance-preserving dimension reduction method containing missing data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110852366A true CN110852366A (en) | 2020-02-28 |
Family
ID=69598532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911059239.5A Pending CN110852366A (en) | 2019-11-01 | 2019-11-01 | Data distance-preserving dimension reduction method containing missing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110852366A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111967499A (en) * | 2020-07-21 | 2020-11-20 | 电子科技大学 | Data dimension reduction method based on self-learning |
CN112183723A (en) * | 2020-09-17 | 2021-01-05 | 西北工业大学 | Data processing method for clinical detection data missing problem |
CN112488321A (en) * | 2020-12-07 | 2021-03-12 | 重庆邮电大学 | Antagonistic machine learning defense method oriented to generalized nonnegative matrix factorization algorithm |
CN113704073A (en) * | 2021-09-02 | 2021-11-26 | 交通运输部公路科学研究所 | Method for detecting abnormal data of automobile maintenance record library |
CN114722943A (en) * | 2022-04-11 | 2022-07-08 | 深圳市人工智能与机器人研究院 | Data processing method, device and equipment |
-
2019
- 2019-11-01 CN CN201911059239.5A patent/CN110852366A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111967499A (en) * | 2020-07-21 | 2020-11-20 | 电子科技大学 | Data dimension reduction method based on self-learning |
CN111967499B (en) * | 2020-07-21 | 2023-04-07 | 电子科技大学 | Data dimension reduction method based on self-learning |
CN112183723A (en) * | 2020-09-17 | 2021-01-05 | 西北工业大学 | Data processing method for clinical detection data missing problem |
CN112488321A (en) * | 2020-12-07 | 2021-03-12 | 重庆邮电大学 | Antagonistic machine learning defense method oriented to generalized nonnegative matrix factorization algorithm |
CN112488321B (en) * | 2020-12-07 | 2022-07-01 | 重庆邮电大学 | Antagonistic machine learning defense method oriented to generalized nonnegative matrix factorization algorithm |
CN113704073A (en) * | 2021-09-02 | 2021-11-26 | 交通运输部公路科学研究所 | Method for detecting abnormal data of automobile maintenance record library |
CN113704073B (en) * | 2021-09-02 | 2024-06-04 | 交通运输部公路科学研究所 | Method for detecting abnormal data of automobile maintenance record library |
CN114722943A (en) * | 2022-04-11 | 2022-07-08 | 深圳市人工智能与机器人研究院 | Data processing method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110852366A (en) | Data distance-preserving dimension reduction method containing missing data | |
Zhang et al. | Blind image quality assessment using a deep bilinear convolutional neural network | |
Peng et al. | Robust semi-supervised nonnegative matrix factorization for image clustering | |
Li et al. | Deep independently recurrent neural network (indrnn) | |
Ristovski et al. | Continuous conditional random fields for efficient regression in large fully connected graphs | |
CN112464865A (en) | Facial expression recognition method based on pixel and geometric mixed features | |
EP2973241A2 (en) | Signal processing systems | |
WO2022105117A1 (en) | Method and device for image quality assessment, computer device, and storage medium | |
Xing et al. | A self-organizing incremental neural network based on local distribution learning | |
Yang | A CNN-based broad learning system | |
Zhou et al. | Improved cross-label suppression dictionary learning for face recognition | |
Wan et al. | Feature consistency training with JPEG compressed images | |
CN112749737A (en) | Image classification method and device, electronic equipment and storage medium | |
He et al. | Finger vein image deblurring using neighbors-based binary-GAN (NB-GAN) | |
Liu et al. | Modal regression-based graph representation for noise robust face hallucination | |
Li et al. | A graphical approach for filter pruning by exploring the similarity relation between feature maps | |
Liu et al. | Revisiting pseudo-label for single-positive multi-label learning | |
Yan et al. | A parameter-free framework for general supervised subspace learning | |
Zhao et al. | Exploiting channel similarity for network pruning | |
CN110288002B (en) | Image classification method based on sparse orthogonal neural network | |
Yao | A compressed deep convolutional neural networks for face recognition | |
CN111401440A (en) | Target classification recognition method and device, computer equipment and storage medium | |
CN114882288B (en) | Multi-view image classification method based on hierarchical image enhancement stacking self-encoder | |
Zhang et al. | Research On Face Image Clustering Based On Integrating Som And Spectral Clustering Algorithm | |
Yang et al. | Fine-grained image quality caption with hierarchical semantics degradation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200228 |