CN110852366A - Data distance-preserving dimension reduction method containing missing data - Google Patents

Data distance-preserving dimension reduction method containing missing data Download PDF

Info

Publication number
CN110852366A
CN110852366A CN201911059239.5A CN201911059239A CN110852366A CN 110852366 A CN110852366 A CN 110852366A CN 201911059239 A CN201911059239 A CN 201911059239A CN 110852366 A CN110852366 A CN 110852366A
Authority
CN
China
Prior art keywords
encoder
data
matrix
sample
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911059239.5A
Other languages
Chinese (zh)
Inventor
从银川
谢鲲
欧阳与点
文吉刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201911059239.5A priority Critical patent/CN110852366A/en
Publication of CN110852366A publication Critical patent/CN110852366A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a data distance-keeping dimension reduction method containing missing data, and relates to the technical field of data processing. According to the data distance-keeping dimension reduction method, the missing part in the original data does not participate in the calculation of the loss function of the automatic encoder through the missing data matrix, so that the automatic encoder can process the dimension reduction containing the missing data, and the influence of the missing data on the automatic encoder is avoided; meanwhile, by means of strong automatic learning capacity of an automatic encoder, complex nonlinear relations among original data can be effectively captured, dimension reduction processing is enabled to have distance keeping performance by restricting and updating a weight matrix of the encoder in a loss function, distribution information of original high-dimensional data is reserved for low-dimensional data after dimension reduction to the greatest extent, subsequent data processing is facilitated, and data processing time and space are saved.

Description

Data distance-preserving dimension reduction method containing missing data
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a data distance-preserving dimension reduction method containing missing data based on an automatic encoder.
Background
With the advent of the big data era and the popularization of electronic equipment, massive high-dimensional data is generated, great time and space expenses are generally needed for directly analyzing and processing the high-dimensional data, and dimension reduction is increasingly favored by people as an algorithm for mapping the high-dimensional data to a low-dimensional space and retaining original data information. Dimension reduction algorithm such as Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) is applied to reduce the dimension of high-dimensional data, which can bring great convenience to subsequent data processing. However, most of the data generated in reality contain missing data, and the traditional dimension reduction method cannot process the data with the missing data.
The distance-preserving dimensionality reduction refers to that the low-dimensional data after dimensionality reduction maintains the Euclidean distance between high-dimensional original data to a certain extent. Existing dimension reduction algorithms all have the capability of retaining high-dimensional data information, but maintain the distance of dimension reduction without display. It is generally considered that the distribution of the original data can be maintained by maintaining the Euclidean distance of the original data through distance-preserving dimension reduction, so that the data after dimension reduction maximally maintains the information between the sample points of the original data. Although the traditional distance-preserving dimension reduction algorithm is widely used in the data processing process, the traditional distance-preserving dimension reduction algorithm only adopts a linear model and cannot capture complex nonlinear information between high-dimensional data. In addition, the practical high-dimensional data not only has a complex nonlinear relationship, but also usually has the condition that some dimensional data of the data are missing, and the traditional distance-preserving dimension reduction method cannot effectively process the data with the missing.
In 2006, Hinton and Salakhutdinov proposed "Reducing the dimensional of data with Neural Networks" and applied an auto-encoder to data Dimensionality reduction. Such an automatic encoder with data dimension reduction function is a special fully-connected neural network, whose network structure has a symmetrical structure and its output is assumed to be equal to the input. For example, a three-layer structure is providedThe input of the dynamic encoder is
Figure BDA0002257426220000021
Figure BDA0002257426220000022
Output is as
Figure BDA0002257426220000023
Wherein the content of the first and second substances,
Figure BDA0002257426220000025
f is an activation function, w1,w2A weight matrix for an automatic encoder, b1,b2For the bias of the auto-encoder, s represents the size of the batch, and the loss function of the auto-encoder is
Figure BDA0002257426220000026
The automatic encoder adopts a back propagation algorithm for training, the weight and the bias are updated in the training process so that the value of a loss function reaches the minimum, and complex nonlinear information among high-dimensional data can be captured through continuously learning sample data. In order to prevent the situation that the auto-encoder learns an identity function instead of looking for structural information between data, the number of hidden layer nodes in the auto-encoder is usually limited to be much smaller than the number of input layer nodes. After training is completed, the dimension of data is reduced by using an encoder of an automatic encoder, and the output of the encoder
Figure BDA0002257426220000027
The data after dimension reduction.
Although the automatic encoder has the capability of learning the complex nonlinear structure of high-dimensional data, the automatic encoder cannot directly process the missing data. The dimension reduction is generally affected by missing data, a data filling method is adopted, the mean value or mode of the dimension feature is filled in the missing position, but the filling of the mean value or mode is lack of rationality, the dimension reduction effect and the subsequent data analysis are affected, and the like. At the same time, the auto-encoder constrains the post-dimensionality reduction keep-distance without display.
Random Projection (RP) is a linear dimension reduction idea with distance-keeping property, and sets dimension reduction dataRandom matrix
Figure BDA0002257426220000029
After reducing dimension
Figure BDA00022574262200000210
The data distance before and after dimension reduction meets the following requirements:
(1-α)||xl-xt||≤||cl-ct||≤(1+α)||xl-xt||
wherein α is a constant of 0 to 1, the smaller α is, the better the dimension reduction is, and c isl,ctFor a vector x to be reducedl,xtAnd (5) reducing the vector after dimension reduction. With the research of the stochastic projection algorithm, a plurality of methods for constructing the stochastic matrix R appear. As early as 1984, Johnson and Lindenstaus demonstrated that if random matrix R is a random unit orthogonal matrix, i.e., each column in R is orthogonal to each other and has a length of 1, the random projection achieved by random matrix R is distance preserving. Arriaga and Vempala in the article "An aluminum chemical of learning: the Robust contexts and random projection' proposes a "neuron-friendly" random projection, the greatest contribution of which is to prove that a random matrix R with a mean value of 0 and a variance of 1 has distance-preserving property, and also mentions that the only condition for the distance-preserving property of the random projection is that the mean value of elements in the random matrix R is 0, but the variance of the elements in the random matrix R can greatly influence the effect of reducing the dimension distance-preserving property. Ping et al propose a sparse random matrix R in the article "Very sparse random projects" with elements in the R being in terms of
Figure BDA0002257426220000031
Randomly taking the value of { -1, 0, -1} and verifyingThe random projection realized by the sparse random matrix has distance keeping property. The above stochastic projection process does not consider that adding an activation function may affect the distance-preserving property of stochastic projection, and the article "Signal recovery from l" by Bruna et al in 2014pThe firing representation "proves that the single-layer neural network using the ReLU function or the sigmoid function has the function of distance-preserving embedding, namely:
A||xl-xt||≤||cl-ct||≤B||xl-xt||,0<A≤B
a and B are constants, | | ★ | | | represents a two-norm of a vector, the above formula can be regarded as a relaxation version of distance keeping property proved by Johnson and Lindenstaus, and the neural network which uses ReLU or sigmood as an activation function and satisfies random projection of a weight matrix has certain distance keeping property.
Disclosure of Invention
Aiming at the problems that the traditional dimension reduction algorithm cannot carry out dimension reduction on missing data, cannot effectively learn complex nonlinear structure information of high-dimensional data and does not have distance maintenance in the dimension reduction process in the prior art, the invention provides a data distance maintenance dimension reduction method containing the missing data.
The invention solves the technical problems through the following technical scheme: a data distance-keeping dimension reduction method containing missing data comprises the following steps:
step 1: acquiring a sample data set, dividing the sample data set into a training sample set and a testing sample set, and vectorizing all samples in the sample data set to form a sample matrix; generating a missing data matrix corresponding to the sample matrix one by one according to the sample matrix, wherein the value of each element in the missing data matrix is 1 or 0, 1 represents that the position data in the sample matrix is normal, and 0 represents that the position data in the sample matrix is missing;
step 2: constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder;
and step 3: designing a loss function of the automatic encoder in the step 2 according to the missing data matrix in the step 1;
and 4, step 4: selecting a sample vector from the training sample set in the step 1 as the input of an automatic encoder, and calculating the value of the loss function in the step 3;
and 5: updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6;
step 6: and (4) performing dimensionality reduction on the sample matrix in the test sample set by adopting the trained automatic encoder.
The invention has the distance-preserving dimension-reducing method of the missing data, the missing data matrixes which are in one-to-one correspondence with the sample matrixes are generated according to the sample matrixes, and the loss functions of the automatic encoder are designed through the missing data matrixes, so that the missing data does not participate in the calculation of the loss functions, and the influence of the missing data on the automatic encoder is avoided; training an automatic encoder through a training sample set, and enabling the automatic encoder to continuously learn the feature information of different samples in the training sample set so as to learn the features of missing data and nonlinear information among data; in the training process, updating the weight matrix of the encoder in the loss function, enabling the updating of the weight matrix of the encoder to have a random projection characteristic, and enabling the automatic encoder to have a function of distance-preserving mapping in the dimension reduction process, namely enabling Euclidean distances between data sample points after dimension reduction to be consistent with Euclidean distances between corresponding original data sample points before dimension reduction in proportion to a certain degree; therefore, the dimensionality reduction method can achieve the purpose of distance-preserving dimensionality reduction processing on high-dimensional data with missing data, effectively learns nonlinear information among the high-dimensional data, enables the data after dimensionality reduction to keep own characteristic information of the data before dimensionality reduction and characteristic information among the data as much as possible, and saves time and space for processing the low-dimensional data.
Further, in step 2, the network structure of the encoder of the automatic encoder is input layer-first hidden layer-second hidden layer-third hidden layer-output layer, the network structure of the encoder of the automatic encoder is input layer-first hidden layer-second hidden layer, and the network structure of the decoder of the automatic encoder is second hidden layer-third hidden layer-output layer.
Further, in the step 3, the loss function L (W) of the automatic encodere,Wd,bd) Comprises the following steps:
Figure BDA0002257426220000051
where | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, xtFor the t-th input vector of the auto-encoder,
Figure BDA0002257426220000053
is xtAs output vector at the input of the autoencoder, rtIs given as the sum of the input vector xtCorresponding missing data vector, xt∈X,rtE.g. R, X is a sample matrix, R is a missing data matrix corresponding to the sample matrix X,
Figure BDA0002257426220000054
representing k sample vectors and each having a length m, s being the size of one batch,
Figure BDA0002257426220000056
D(★;Wd,bd) Representing a decoder, E (★; W)e) Representing an encoder, WeIs a weight matrix of the encoder, WdIs a weight matrix of the decoder, bdRepresenting the bias term of the decoder.
According to the calculation formula of the loss function, the loss function is zeroTheoretical input vector xtWhat the missing data is, by comparison with the missing data vector rtThe missing data part does not participate in the calculation of the loss function by element multiplication, so the influence of the missing data on an automatic encoder is eliminated, and the dimension reduction processing can be performed on the high-dimensional data containing the missing data without losing the rationality.
Further, in step 5, the weight matrix of the encoder in the loss function is updated by weight normalization processing, so that the average value of the updated weight matrix of the encoder is 0 and the variance is 1, and a specific weight normalization processing formula is as follows:
Figure BDA0002257426220000061
wherein, weiEncoder ith weight matrix, w 'before weight normalization processing for BP algorithm update'eiIs a weight matrix weiWeight normalized matrix, mueiIs weiMean value of (a)eiIs weiVariance of (E), EeiIs a matrix with elements all 1.
Updating the encoder weight matrix to obtain an updated encoder weight matrix w'eiThe mean value of (1) is 0 and the variance is 1, so that the condition that the random projection in the background technology has distance keeping property is met: the mean value of elements in the random matrix R is 0, and the variance is 1, so that the automatic encoder in the invention can process dimension reduction containing missing data, has a distance-maintaining characteristic for the dimension reduction, and can maintain the original distribution of high-dimensional data to the maximum extent.
Further, a regularization term is added on the basis of the loss function in the step 3, and the weight matrix of the encoder is updated according to the value of the loss function after the regularization term is added.
In order to avoid the influence on the stability of the automatic encoder training process caused by updating the weight matrix of the encoder through weight standardization processing, the weight matrix of the encoder is updated in a mode of adding a regularization item, the mean value of the updated weight matrix of the encoder is 0, the variance of the updated weight matrix of the encoder is 1, the condition with distance keeping performance is met, the dimension reduction processing of the automatic encoder has distance keeping characteristics, and redundant information can be removed through the regularization item.
Further, the regularization term LrThe expression of (a) is:
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein, weiFor the ith weight matrix, σ, of the encodereiIs a weight matrix weiVariance of (d), μeiIs a weight matrix weiC is the number of encoder weight matrices, L is a loss function without adding a regularization term, L is a loss functionCTo add the loss function after the regularization term, α is a hyperparameter.
Further, the regularization term LrThe expression of (a) is:
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein | ★ | purpleFIs the F norm of the matrix, weiFor the i-th weight matrix of the encoder,
Figure BDA0002257426220000072
is a weight matrix weiI is the identity matrix, c is the number of encoder weight matrices, L is the loss function without adding regularization terms, L is the number of encoder weight matrices, L is the number of regularization termsCTo add the loss function after the regularization term, α is a hyperparameter.
Advantageous effects
Compared with the prior art, the data distance-preserving dimension reduction method containing the missing data provided by the invention has the advantages that the missing part in the original data does not participate in the calculation of the loss function of the automatic encoder through the missing data matrix, so that the automatic encoder can process the dimension reduction containing the missing data, and the influence of the missing data on the automatic encoder is avoided; meanwhile, by means of strong automatic learning capacity of an automatic encoder, complex nonlinear information among original data and characteristic information of the original data can be effectively captured, then the weight matrix of the encoder in the loss function is updated, the mean value of the weight matrix of the encoder is 0, the variance of the weight matrix of the encoder is 1 (the condition that random projection has distance keeping performance is met), dimension reduction processing has distance keeping performance, distribution information and self-specific information of original high-dimensional data are kept in the reduced low-dimensional data to the maximum extent, subsequent data processing is facilitated, and data processing time and space are saved.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only one embodiment of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is an exemplary diagram of a handwritten digital picture in an embodiment of the invention;
FIG. 2 is a flow chart of a distance preserving dimension reduction method in an embodiment of the invention;
FIG. 3 is a network architecture diagram of an autoencoder in an embodiment of the present invention;
FIG. 4 is a test result in the absence of an MNIST data set in an embodiment of the present invention;
FIG. 5 is a test result in the case of a deletion in the MNIST dataset according to an embodiment of the present invention;
FIG. 6 is a distribution diagram of distance ratios before and after dimensionality reduction for the three methods in the absence of an MNIST data set in an embodiment of the present invention;
wherein 1-input layer, 2-first hidden layer, 3-second hidden layer, 4-third hidden layer, 5-output layer, 6-encoder, 7-decoder.
Detailed Description
The technical solutions in the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The MNIST data set is a large handwritten digital database collected and arranged by American national institute of standards and technology, the data set comprises 7 million pictures, the MNIST data set is divided into a training set and a testing set, the testing set comprises 10000 pictures, the training set comprises 60000 pictures, each picture is a handwritten digital picture (shown in figure 1) with 28 x 28 pixel points and 0-9, the training set comprises 60000 pictures, the testing set comprises white characters on a black bottom, the black bottom is represented by 0, the white characters are represented by floating point numbers between 0 and 1, and the closer to 1, the whiter the color. As shown in fig. 2, the specific implementation steps of the present invention are as follows:
1. loading a data set, dividing the data set into a training sample set and a testing sample set, vectorizing all samples (each sample is a digital picture consisting of 784 pixel points) in the data set to form a sample matrix X consisting of a plurality of sample vectors, wherein each sample vector in the sample matrix X is a one-dimensional array with the length of 784, and each element in the one-dimensional array is a floating point number between 0 and 1.
And generating a missing data matrix R corresponding to the sample matrix X one to one according to the sample matrix X, wherein the value of each element in the missing data matrix R is 1 or 0, 1 represents that the position data in the sample matrix X is normal, and 0 represents that the position data in the sample matrix X is missing. Let xtFor the t-th sample vector, the sample vector,
Figure BDA0002257426220000091
is xtAs output vector at the input of the autoencoder, rtIs a sum sample vector xtCorresponding missing data vector, xt∈X,rt∈R,
Figure BDA0002257426220000092
Representing k sample vectors, and each sample vector is m in length. In this example, m is 784.
2. And constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder.
As shown in fig. 3, the network structure of the auto encoder is input layer 1-first hidden layer 2-second hidden layer 3-third hidden layer 4-output layer 5, the network structure of the encoder 6 of the auto encoder is input layer 1-first hidden layer 2-second hidden layer 3, the network structure of the decoder 7 of the auto encoder is second hidden layer 3-third hidden layer 4-output layer 5, the number of nodes of the input layer 1 is 784, the number of nodes of the first hidden layer 2 is 200, the number of nodes of the second hidden layer 3 is 10, the number of nodes of the third hidden layer 4 is 200, the number of nodes of the output layer 5 is 784, and the dimension of the reduced sample vector is 10. In this embodiment, the activation function of the automatic encoder is a sigmiod function, and the initialization method is uniform distribution initialization.
3. Designing a loss function for an auto-encoder
The loss function expression of the auto-encoder is:
Figure BDA0002257426220000095
in the formula (1), | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, and xtFor the t-th input vector of the auto-encoder,
Figure BDA0002257426220000096
is xtAs output vector at the input of the autoencoder, rtIs given as the sum of the input vector xtThe corresponding missing data vector, s is the size of a batch,
Figure BDA0002257426220000101
Figure BDA0002257426220000102
we1,we2is a weight matrix of the encoder, wd1,wd2Is a weight matrix of the decoder, bd1,bd2For the offset term of the decoder, f is the activation function, and for maintaining the scalability, the encoder of the auto-encoder has no offset term.
As can be seen from equation (1), no matter the input vector xtWhat the missing data is, by comparison with the missing data vector rtThe missing data part does not participate in the calculation of the loss function by element multiplication one by one, so the influence of the missing data on the automatic encoder is eliminated, and the automatic encoder can perform dimensionality reduction processing on high-dimensional data containing the missing data without losing rationality.
4. Randomly selecting a sample vector from a training sample set as an input of an automatic encoder, dividing the sample vector into n batches, wherein the number of the sample vectors in each batch is s, and calculating the value of the loss function in the formula (1).
5. Updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; and after all batches of training are finished, judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6.
The invention provides three different methods for restricting the updating of the weight matrix of the encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has the random projection characteristic.
The first method is a weight normalization method, which sets we1Updating a weight matrix, w ', before weight normalization processing for an input layer to first hidden layer BP algorithm'e1Is a weight matrix we1Weight normalized matrix, we2Weight matrix, w 'before weight normalization processing after updating of BP algorithm from first hidden layer to second hidden layer'e2Is a weight matrix we2Weight normalized matrix, mue1、μe2Are respectively we1、we2Mean value of (a)e1、σe2Are respectively we1、we2Variance of (E), Ee1、Ee2Is a matrix with elements all being 1,
Figure BDA0002257426220000103
Figure BDA0002257426220000104
the weight normalization process formula is:
Figure BDA0002257426220000105
constraining encoder weight matrix w by weight normalization processing methode1、we2Updating of (2) to make the updated weight matrix w'e1、w′e2Has a mean value of 0 and a variance of 1, and satisfies the condition (random matrix w ') that the random projection in the background art has distance keeping property'e1、w′e2The mean value of the inner elements is 0, and the variance is 1), so that the dimension reduction processing of the automatic encoder has distance keeping property, the BP algorithm is firstly updated, and then the weight standardization processing is carried out, so as to reduce the copy error.
Therefore, another method for constraining the update of the encoder weight matrix is proposed in this embodiment, that is, a regularization term is added, and a regularization term is added on the basis of an original automatic encoder loss function formula (1) to constrain the update of the encoder weight matrix, where one regularization term is added to make a mean value of encoder weight elements 0 and a variance 1, and another regularization term is added to make a unit orthogonal regularization term, and specific expressions of the two regularization terms are as follows, and a regularization term expression of the first encoder weight normalization is (a second constraint method):
Figure BDA0002257426220000111
the expression of the loss function after adding the regularization term is:
LC=L+αLr(4)
wherein, weiFor the ith weight matrix, σ, of the encodereiIs a weight matrix weiVariance of (d), μeiIs a weight matrix weiThe number of encoder weight matrices is 2, L is a loss function without adding a regularization term, L is a loss functionCIn order to increase the loss function after the regularization term, α is a hyperparameter, α adopts a test method to determine the value, an initial value is set firstly, whether the hyperparameter α meets the requirement is confirmed through the CV value or the topK score, if not, the test is continued until the CV value is smaller or the topK score is higher.
The second unit orthogonal regularization term expression is (third constraint method):
Figure BDA0002257426220000121
the expression of the loss function after adding the regularization term is:
LC=L+αLr(6)
wherein | ★ | purpleFIs the F-norm of the matrix and,
Figure BDA0002257426220000122
is a weight matrix weiI is an identity matrix.
And after the regularization item is added, calculating the value of the loss function, and updating the encoder weight matrix according to the value of the loss function to enable the mean value of the encoder weight matrix to be 0 and the variance to be 1. In this embodiment, the BP algorithm is used to update the automatic encoder, after the weight matrix of the automatic encoder is updated, if the preset training frequency is not reached, the step 4 is returned to continue training the automatic encoder, so that the value of the loss function is reduced until the preset training frequency is reached (the automatic encoder converges), otherwise, the step 6 is shifted to indicate that the automatic encoder has been trained.
6. And (3) carrying out dimension reduction treatment on the sample matrix in the test sample set by adopting a trained automatic encoder, wherein the dimension of the output vector after dimension reduction is 10.
To evaluate whether the dimension reduction treatment has stability, this example employs two methods for evaluation: the first evaluation method is the TopK algorithm, the T0pK algorithm calculates the K neighbors of the samples (in this example, K is 10) using the distance between the samples, compares the number of K neighbors overlapped before and after the dimensionality reduction of the samples, and calculates the TopK score, and the TopK algorithm is the prior art.
Figure BDA0002257426220000123
Respectively representing the first K neighbor sets before and after dimensionality reduction of the ith sample, representing the number of elements in the set N by | N | and representing the number of samples by N, wherein the expression of the TopK algorithm is as follows:
the dimension reduction method with higher TopK score can better save the distance between high-dimensional data before dimension reduction.
The second evaluation method is a CV evaluation method, and the stability of dimension reduction processing is more intuitively represented by a distance ratio distribution graph before and after dimension reduction. And randomly selecting a plurality of pairs of sample points, calculating the ratio of the distance before dimension reduction to the distance after dimension reduction of each pair of sample points, and drawing a distribution diagram of the ratio of the distance before dimension reduction to the distance after dimension reduction of each pair of sample points in the batch. Coefficient of Variation (CV), also called Coefficient of dispersion, is a normalized measure for evaluating the degree of discretization of a probability distribution, and its value is defined as the ratio of the standard deviation to the mean. In the embodiment, the variation coefficient is used for describing the discrete degree of the distribution of the distance ratio before and after dimensionality reduction, and the dimensionality reduction method with a small CV value has good dimensionality reduction stability.
Figure BDA0002257426220000131
A sample vector representing the t-th dimension to be reduced,
Figure BDA0002257426220000132
representing the reduced dimension vector of the t-th sample, ct=f(we1xt)wd1. And calculating the CV value and the TopK score of the sample after the dimensionality reduction, and drawing a distribution diagram of the distance ratio before and after the dimensionality reduction.
Fig. 4-6 show experimental results, fig. 4 shows results after different dimensionality reduction processes in the case where the MNIST dataset has no missing data, in fig. 4, ortho denotes an automatic encoder using a unit orthogonal regularization term (third method), meanstd _ hard denotes an automatic encoder using an encoder weight normalization process (first method), measurstd _ soft denotes an automatic encoder using a regularization term to normalize an encoder weight (second method), AE denotes a general automatic encoder, and PCA denotes a principal component analysis algorithm. Since the initialization methods of the activation function and the auto-encoder affect the test result, the two activation functions of ReLU and sigmoid are used for comparison in fig. 4, the initialization methods use uniform distribution initialization (uniform) and gaussian distribution initialization (normal) for comparison, the bold italic part in fig. 4 represents the best result of the test, as can be seen from fig. 4, the CV values of the three constraint methods proposed by the present invention are all lower than that of the ordinary auto-encoder, the CV value of the ortho method is the lowest, the TopK scores of the three constraint methods proposed by the present invention in the TopK algorithm scores are all higher than those of the ordinary auto-encoder and the PCA algorithm, and the TopK score of the ortho method is the highest, so it can be seen that the feasibility and superiority of the distance-preserving and dimension-reducing method of the present invention on the dimension-reducing distance.
Fig. 5 shows the results after different dimensionality reduction processes are performed under the condition that the MNIST data set manually adds missing data, the abscissa shows the proportion of the missing data, the ordinate shows the TopK score, the activation function selects the sigmoid function, and the initialization method selects uniform distribution initialization.
Fig. 6 is a distribution diagram of distance ratios before and after the dimension reduction by the three methods of the improved auto encoder ortho _ AE, the normal auto encoder AE, and the principal component analysis PCA without missing the MNIST data set, the abscissa indicates the ratio of the distances between the same pair of sample points before and after the dimension reduction, and the ordinate indicates the probability density, and it can be seen from fig. 6 that CV values of the improved auto encoder ortho _ AE, the normal auto encoder AE, and the principal component analysis PCA are 0.153, 0.247, and 0.162, respectively, the CV value of the improved auto encoder ortho _ AE is the smallest, and it indicates that the dimensional stability of the improved auto encoder ortho _ AE is the best.
The above disclosure is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or modifications within the technical scope of the present invention, and shall be covered by the scope of the present invention.

Claims (7)

1. A data distance-keeping dimension reduction method containing missing data is characterized by comprising the following steps:
step 1: acquiring a sample data set, dividing the sample data set into a training sample set and a testing sample set, and vectorizing all samples in the sample data set to form a sample matrix; generating a missing data matrix corresponding to the sample matrix one by one according to the sample matrix, wherein the value of each element in the missing data matrix is 1 or 0, 1 represents that the position data in the sample matrix is normal, and 0 represents that the position data in the sample matrix is missing;
step 2: constructing an automatic encoder, and selecting an activation function and an initialization method of the automatic encoder;
and step 3: designing a loss function of the automatic encoder in the step 2 according to the missing data matrix in the step 1;
and 4, step 4: selecting a sample vector from the training sample set in the step 1 as the input of an automatic encoder, and calculating the value of the loss function in the step 3;
and 5: updating a weight matrix of an encoder in the automatic encoder, so that the updating of the weight matrix of the encoder has a random projection characteristic; judging whether the preset training times are reached, if the training times are less than or equal to the preset training times, turning to the step 4, otherwise, turning to the step 6;
step 6: and (4) performing dimensionality reduction on the sample matrix in the test sample set by adopting the trained automatic encoder.
2. The data distance maintenance dimensionality reduction method according to claim 1, wherein in the step 2, the network structure of the automatic encoder is an input layer-a first hidden layer-a second hidden layer-a third hidden layer-an output layer, the network structure of the encoder of the automatic encoder is an input layer-a first hidden layer-a second hidden layer, and the network structure of the decoder of the automatic encoder is a second hidden layer-a third hidden layer-an output layer.
3. The data distance maintenance dimensionality reduction method of claim 1, wherein in the step 3, a loss function L (W) of an automatic encodere,Wd,bd) Comprises the following steps:
where | ★ | | represents the two-norm of the vector, ⊙ represents the multiplication of elements in two vectors one by one, xtFor the t-th input vector of the auto-encoder,is xtAs output vector at the input of the autoencoder, rtIs given as the sum of the input vector xtCorresponding missing data vector, xt∈X,rtE.g. R, X is a sample matrix, R is a missing data matrix corresponding to the sample matrix X,
Figure FDA0002257426210000022
Figure FDA0002257426210000023
representing k sample vectors and each having a length m, s being the size of one batch,
Figure FDA0002257426210000024
D(★;Wd,bd) Representing a decoder, E (★; W)e) Representing an encoder, WeIs a weight matrix of the encoder, WdIs a weight matrix of the decoder, bdRepresenting the bias term of the decoder.
4. The data distance preserving dimension reduction method according to claim 1, wherein in the step 5, the weight matrix of the encoder in the loss function is updated by weight normalization processing, so that the mean value and the variance of the updated weight matrix of the encoder are 0 and 1, and the specific weight normalization processing formula is as follows:
wherein, weiEncoder ith weight matrix, w 'before weight normalization processing for BP algorithm update'eiIs a weight matrix weiWeight normalized matrix, mueiIs weiMean value of (a)eiIs weiVariance of (E), EeiIs a matrix with elements all 1.
5. The data distance maintenance dimensionality reduction method according to claim 1, wherein a regularization term is added on the basis of the loss function in the step 3, and the weight matrix of the encoder is updated according to the value of the loss function after the regularization term is added.
6. The data-retention dimensionality reduction method of claim 5, wherein the regularization term LrThe expression of (a) is:
Figure FDA0002257426210000026
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein, weiFor the ith weight matrix, σ, of the encodereiIs a weight matrix weiVariance of (d), μeiIs a weight matrix weiC is the number of encoder weight matrices, L is a loss function without adding a regularization term, L is a loss functionCTo add the loss function after the regularization term, α is a hyperparameter.
7. The data-retention dimensionality reduction method of claim 5, wherein the regularization term LrThe expression of (a) is:
Figure FDA0002257426210000031
the expression of the loss function after adding the regularization term is:
LC=L+αLr
wherein | ★ | purpleFIs the F norm of the matrix, weiFor the i-th weight matrix of the encoder,
Figure FDA0002257426210000032
is a weight matrix weiI is the identity matrix, c is the number of encoder weight matrices, L is the loss function without adding regularization terms, L is the number of encoder weight matrices, L is the number of regularization termsCTo add the loss function after the regularization term, α is a hyperparameter.
CN201911059239.5A 2019-11-01 2019-11-01 Data distance-preserving dimension reduction method containing missing data Pending CN110852366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911059239.5A CN110852366A (en) 2019-11-01 2019-11-01 Data distance-preserving dimension reduction method containing missing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911059239.5A CN110852366A (en) 2019-11-01 2019-11-01 Data distance-preserving dimension reduction method containing missing data

Publications (1)

Publication Number Publication Date
CN110852366A true CN110852366A (en) 2020-02-28

Family

ID=69598532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911059239.5A Pending CN110852366A (en) 2019-11-01 2019-11-01 Data distance-preserving dimension reduction method containing missing data

Country Status (1)

Country Link
CN (1) CN110852366A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967499A (en) * 2020-07-21 2020-11-20 电子科技大学 Data dimension reduction method based on self-learning
CN112183723A (en) * 2020-09-17 2021-01-05 西北工业大学 Data processing method for clinical detection data missing problem
CN112488321A (en) * 2020-12-07 2021-03-12 重庆邮电大学 Antagonistic machine learning defense method oriented to generalized nonnegative matrix factorization algorithm
CN113704073A (en) * 2021-09-02 2021-11-26 交通运输部公路科学研究所 Method for detecting abnormal data of automobile maintenance record library
CN114722943A (en) * 2022-04-11 2022-07-08 深圳市人工智能与机器人研究院 Data processing method, device and equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967499A (en) * 2020-07-21 2020-11-20 电子科技大学 Data dimension reduction method based on self-learning
CN111967499B (en) * 2020-07-21 2023-04-07 电子科技大学 Data dimension reduction method based on self-learning
CN112183723A (en) * 2020-09-17 2021-01-05 西北工业大学 Data processing method for clinical detection data missing problem
CN112488321A (en) * 2020-12-07 2021-03-12 重庆邮电大学 Antagonistic machine learning defense method oriented to generalized nonnegative matrix factorization algorithm
CN112488321B (en) * 2020-12-07 2022-07-01 重庆邮电大学 Antagonistic machine learning defense method oriented to generalized nonnegative matrix factorization algorithm
CN113704073A (en) * 2021-09-02 2021-11-26 交通运输部公路科学研究所 Method for detecting abnormal data of automobile maintenance record library
CN113704073B (en) * 2021-09-02 2024-06-04 交通运输部公路科学研究所 Method for detecting abnormal data of automobile maintenance record library
CN114722943A (en) * 2022-04-11 2022-07-08 深圳市人工智能与机器人研究院 Data processing method, device and equipment

Similar Documents

Publication Publication Date Title
CN110852366A (en) Data distance-preserving dimension reduction method containing missing data
Zhang et al. Blind image quality assessment using a deep bilinear convolutional neural network
Peng et al. Robust semi-supervised nonnegative matrix factorization for image clustering
Li et al. Deep independently recurrent neural network (indrnn)
Ristovski et al. Continuous conditional random fields for efficient regression in large fully connected graphs
CN112464865A (en) Facial expression recognition method based on pixel and geometric mixed features
EP2973241A2 (en) Signal processing systems
WO2022105117A1 (en) Method and device for image quality assessment, computer device, and storage medium
Xing et al. A self-organizing incremental neural network based on local distribution learning
Yang A CNN-based broad learning system
Zhou et al. Improved cross-label suppression dictionary learning for face recognition
Wan et al. Feature consistency training with JPEG compressed images
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
He et al. Finger vein image deblurring using neighbors-based binary-GAN (NB-GAN)
Liu et al. Modal regression-based graph representation for noise robust face hallucination
Li et al. A graphical approach for filter pruning by exploring the similarity relation between feature maps
Liu et al. Revisiting pseudo-label for single-positive multi-label learning
Yan et al. A parameter-free framework for general supervised subspace learning
Zhao et al. Exploiting channel similarity for network pruning
CN110288002B (en) Image classification method based on sparse orthogonal neural network
Yao A compressed deep convolutional neural networks for face recognition
CN111401440A (en) Target classification recognition method and device, computer equipment and storage medium
CN114882288B (en) Multi-view image classification method based on hierarchical image enhancement stacking self-encoder
Zhang et al. Research On Face Image Clustering Based On Integrating Som And Spectral Clustering Algorithm
Yang et al. Fine-grained image quality caption with hierarchical semantics degradation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200228