CN112712855A

CN112712855A - Joint training-based clustering method for gene microarray containing deletion value

Info

Publication number: CN112712855A
Application number: CN202011578976.9A
Authority: CN
Inventors: 马千里; 陈楚鑫
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-04-27
Anticipated expiration: 2040-12-28
Also published as: CN112712855B

Abstract

The invention discloses a clustering method of a gene microarray containing a deletion value based on joint training, which comprises the following steps: calculating the deletion rate of gene microarray data, removing gene points with the deletion rate exceeding 10%, and then dividing the gene microarray data into a training set and a test set; constructing a deep neural network, which comprises a Sequence-To-Sequence encoding-decoding network and a confrontation learning module, and training the constructed deep neural network by using a training set To determine parameters of the deep neural network; inputting the test set into a deep neural network to obtain deep characteristic representation of the gene microarray data containing the deletion value in the test set; and finally, applying a K-means clustering algorithm to the deep characteristic representation to obtain a clustering result of the gene microarray containing the deficiency value. The invention can provide an end-to-end framework for gene expression data clustering, and solves the difficulty that the traditional method needs to select a proper filling method and a clustering method to combine.

Description

Joint training-based clustering method for gene microarray containing deletion value

Technical Field

The invention relates to the technical field of gene microarrays, in particular to a clustering method of a gene microarray containing a deletion value based on joint training.

Background

The clustering method has important experimental value in gene microarray research. For example, when expression sequences of several genes are obtained, clustering the expression sequences of the genes is a common experimental method, and the clustering method can assign genes with similar phenotypes to the same cluster, so as to provide help for further determining the functions of the genes.

Missing values are often present during the acquisition of gene expression sequence data due to human oversight or uncontrolled experimental error. The appearance of deletion values has posed a significant obstacle to downstream studies of gene microarrays. Reconfiguring the experimental environment to obtain the expression sequence of the complete gene is expensive and impractical. Therefore, it is necessary to extract information useful for analysis from gene expression data containing deletion values.

When data has missing values, a common method is to fill in the missing values and then cluster the filled data. However, the two-stage method completely separates the filling process from the clustering process, and errors introduced in the filling process cannot be reduced in the clustering process, so that clustering performance may be reduced, and suboptimal solution may be caused. Recent studies have shown that methods with high fill-in accuracy do not necessarily achieve higher clustering performance in downstream clustering tasks than methods with low fill-in accuracy. In addition, for the clustering task, how to select the optimal filling method and the optimal clustering method combination for the gene microarray containing the deletion value under the unsupervised environment is also a difficult problem.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a clustering method of a gene microarray containing deletion values based on joint training. The invention trains the filling process and the clustering process as a whole, obtains a value with higher precision for the filling task by using a counterstudy strategy, and codes gene microarray data into a low-dimensional effective representation by using a coding-decoding network so as to improve the clustering performance.

The purpose of the invention can be achieved by adopting the following technical scheme:

a clustering method based on a joint training gene microarray containing deletion values, the clustering method comprising the steps of:

s1, obtaining a gene microarray data set sample, and dividing gene microarray data into a training set and a test set, wherein the gene microarray data set sample is a data matrix, the data matrix is composed of N row vectors, each row vector of the data matrix corresponds to an expression data sequence of a gene, and N represents the number of genes recorded in the data matrix;

s2, constructing a deep neural network containing a deletion value gene microarray based on joint training, wherein the deep neural network comprises a Sequence-To-Sequence encoding-decoding network and a resistance learning module which are sequentially connected, the Sequence-To-Sequence encoding-decoding network consists of an encoder and a decoder, the encoder encodes gene microarray data into a deep feature representation, and the decoder utilizes the deep feature representation To reconstruct the input of the encoder; the confrontation learning module consists of a discriminator, and combines a predicted value generated by the encoder and the input of the encoder as an observed value to obtain a filling value; the purpose of designing the codec network is to learn a high quality feature representation for the input gene microarray data using the feature learning capabilities of the codec network. The purpose of the discriminator is to provide a direct supervisory signal for the encoder generation task, making the fill-in value closer to the real value

S3, training the deep neural network by using the training set, and determining learnable parameters in the deep neural network;

s4, inputting the test set into a deep neural network to obtain deep characteristic representation of the gene microarray data containing the deletion value in the test set;

s5, clustering the deep feature representation by using a K-means clustering method to obtain a clustering result of the gene microarray containing the missing value in the test set.

Further, before the step S1 of dividing the gene microarray dataset sample into the training set and the test set, the method further includes:

and counting the deletion rate of all genes in the gene microarray data set sample, removing the genes with the deletion rate of more than 10 percent, and combining the genes which are not removed into a new data matrix according to the original sequence.

Further, the step S2 process is as follows:

s21, constructing an encoder, wherein the encoder is composed of a multi-layer bidirectional cyclic neural network, predicts missing values by using observation values in data samples, fills the predicted values into corresponding missing points, and pushes iteration of the cyclic neural network; coding the data sample by a coder to obtain deep layer characteristic representation of the corresponding data sample;

s22, constructing a decoder, wherein the decoder consists of a single-layer unidirectional cyclic neural network, and the deep features represent data samples input through the decoder reconstruction;

and S23, constructing a discriminator, wherein the discriminator is composed of a multi-layer unidirectional cyclic neural network, receives the filled data samples, and outputs a confidence coefficient of each time step data in the data samples, and the confidence coefficient represents the probability that the corresponding time step data is a true value.

Further, the learnable parameters in the deep neural network include: the number of layers and the number of hidden layer units of the encoder, the decoder and the discriminator, the size of a training batch, the dimension of a depth feature representation space and the training times.

Further, in the step S3, in an optimization iteration process, firstly, the coding-decoding network is trained, then the counterlearning module is trained, and then the iteration process is repeated until the maximum number of iterations is reached, where the process of training the Sequence-To-Sequence coding-decoding network is as follows:

s311, inputting the training samples into an encoder to obtain a predicted value sequence of the encoder and a reconstruction sequence of a decoder;

s312, optimizing parameters of the coding-decoding network by reducing the prediction loss function, the reconstruction loss function, and the predicted value versus the value of the loss function through a gradient descent method, where the prediction loss function is defined as follows:

wherein S represents the training set size, X_iA gene microarray data matrix containing deletion values representing the input to the encoder,

a predictor sequence matrix, M, representing the output of the encoder_iIndicating a missing value indication matrix corresponding to the input of the encoder;

the reconstruction loss function is defined as follows:

wherein R is_iA matrix of reconstructed data representing the decoder output;

the predicted countermeasure loss function is defined as follows:

wherein D represents a discriminator, U_iThe filling value obtained by combining the observed value and the predicted value indicates a bitwise multiplication operation, and the log (-) indicates a logarithm operation, and the filling value obtained by combining the observed value and the predicted value has the following expression:

wherein the process of training the confrontation learning module is as follows:

inputting the predicted value sequence into a discriminator to obtain a true value probability vector, wherein the true value probability vector consists of an observed value probability and a filling value probability, the value of a true value probability loss function is reduced by a gradient descent method to complete the optimization of parameters of the discriminator, and the true value probability loss function is defined as follows:

compared with the prior art, the invention has the following advantages and effects:

1) the invention jointly optimizes the filling process and the clustering process, introduces the counterstudy strategy to improve the precision of the filling value, optimizes the filling process towards the clustering process and obtains better clustering performance.

2) The invention provides an end-to-end missing value-containing gene microarray clustering framework, the filling process and the clustering process are integrated, the traditional framework of filling first and then clustering is separated, and the optimal filling method and clustering method combination do not need to be selected for the missing value-containing microarray clustering task.

Drawings

FIG. 1 is a flow chart of a method for clustering gene microarrays containing deletion values based on joint training, which is disclosed in the embodiments of the present invention;

FIG. 2 is a schematic diagram of gene microarray data clustering results of a gene microarray based on a joint training deletion value-containing gene microarray clustering method disclosed in the embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

As shown in fig. 1, this embodiment discloses a gene microarray clustering method containing a missing value based on joint training, first obtaining a gene microarray dataset sample, where the gene microarray dataset sample is a data matrix, and dividing gene microarray data into a training set and a test set, where in this embodiment, the data matrix is composed of 500 row vectors, each row vector of the data matrix corresponds to an expression data sequence of a gene, and 500 represents the number of genes recorded in the data matrix; then, constructing a deep neural network containing a deletion value gene microarray based on joint training, wherein the deep neural network comprises a Sequence-To-Sequence coding-decoding network and an antagonistic learning module; training the deep neural network by using a training set, and determining learnable parameters in the deep neural network; inputting the test set into a deep neural network to obtain deep characteristic representation of the gene microarray data containing the deletion value in the test set; and clustering the deep feature representations by using a K-means clustering method to obtain a clustering result of the gene microarray containing the deletion value in the test set.

S1, preprocessing the acquired gene microarray data set sample

Counting respective deletion rates of all genes in the gene microarray dataset, and rejecting genes with the deletion rate exceeding 10%, wherein the genes with the deletion rate being too high are not only unfavorable for clustering but also easy to introduce too many filling deviations to influence the filling of other genes due to too many deletion values often existing in the gene microarray dataset. Thus, genes whose deletion rate exceeds the threshold value by 10% are knocked out to obtain 384-row gene microarray data with a more compact data distribution, wherein 384 represents the number of genes after knocking out.

S2, constructing a deep neural network containing a deletion value gene microarray based on joint training, wherein the deep neural network comprises a Sequence-To-Sequence encoding-decoding network and a counterstudy module;

the Sequence-To-Sequence encoding-decoding network is comprised of an encoder that encodes gene microarray data into a deep feature representation and a decoder that utilizes the deep feature representation To reconstruct the input To the encoder. In addition, the encoder also predicts the subsequent missing value by using the information of the previous observed value and fills the predicted value into the corresponding missing value so as to push the iteration of the recurrent neural network to be carried out. The encoder consists of a multilayer bidirectional cyclic neural network, and the decoder consists of a single-layer unidirectional cyclic neural network. The invention utilizes the superiority of the recurrent neural network in the sequence data modeling capability to obtain more effective deep characteristic representation of the gene microarray containing the deletion value.

The confrontation learning module consists of a discriminator. And combining the predicted value generated by the encoder with the input of the encoder, namely the observed value to obtain a filling value. The discriminator receives the filling value as input, and returns a positive monitoring signal to the observed value and a negative monitoring signal to the filling value. The supervision signal returned by the discriminator prompts the encoder to generate a predicted value with higher precision. The discriminator consists of a multilayer unidirectional cyclic neural network.

s31, training the Sequence-To-Sequence encoding-decoding network.

S311, inputting the training samples into the encoder to obtain a predicted value sequence of the encoder and a reconstruction sequence of the decoder.

the reconstruction loss function is defined as follows:

wherein R is_iA matrix of reconstructed data representing the decoder output;

the predicted countermeasure loss function is defined as follows:

and S32, training a confrontation learning module. Inputting the predicted value sequence into a discriminator to obtain a true value probability vector, wherein the true value probability vector consists of an observed value probability and a filling value probability, and the parameter of the discriminator is optimized by reducing the value of a true value probability loss function through a gradient descent method. Wherein the true value probability loss function is defined as follows:

in the primary optimization iteration process, the coding-decoding network is trained, and then the counterstudy module is trained. The iterative process is repeated until a maximum number of iterations is reached.

s5, clustering the deep feature representation by using a K-means clustering method to obtain a clustering result of the gene microarray containing the missing value in the test set. FIG. 2 is a graph showing the results of visualization of the deep signature using a t-sne dimension reduction method down to a 2-dimensional aperture, where the circles and crosses represent different classes of gene microarray data points, respectively. FIG. 2 shows that clear separable class boundaries exist in deep feature representations of gene microarrays obtained by using the deep neural network containing deletion value gene microarrays based on joint training, deep feature representations of gene microarray data of the same class are close in distance, and the obtained deep feature representations show an obvious cluster structure, so that the representation of a clustering algorithm is promoted.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A clustering method based on a joint training gene microarray containing deletion values is characterized by comprising the following steps:

s2, constructing a deep neural network containing a deletion value gene microarray based on joint training, wherein the deep neural network comprises a Sequence-To-Sequence encoding-decoding network and a resistance learning module which are sequentially connected, the Sequence-To-Sequence encoding-decoding network consists of an encoder and a decoder, the encoder encodes gene microarray data into a deep feature representation, and the decoder utilizes the deep feature representation To reconstruct the input of the encoder; the confrontation learning module consists of a discriminator, and combines a predicted value generated by the encoder and the input of the encoder as an observed value to obtain a filling value;

2. The method for clustering gene microarrays with deletion values based on combined training as claimed in claim 1, wherein the step S1 further comprises before the step of dividing the gene microarray data set samples into the training set and the test set:

3. The method for clustering gene microarrays containing deletion values based on combined training as claimed in claim 1, wherein the step S2 comprises the following steps:

4. The method for clustering gene microarrays containing deletion values based on joint training as claimed in claim 1, wherein the learnable parameters in the deep neural network comprise: the number of layers and the number of hidden layer units of the encoder, the decoder and the discriminator, the size of a training batch, the dimension of a depth feature representation space and the training times.

5. The method for clustering gene microarrays containing deletion values based on combined training as claimed in claim 1, wherein the step S3 is To train the encoding-decoding network first in an optimization iteration process, then train the anti-learning module, and then repeat the iteration process until reaching the maximum iteration number, wherein the process of training the Sequence-To-Sequence encoding-decoding network is as follows:

the reconstruction loss function is defined as follows:

wherein R is_iA matrix of reconstructed data representing the decoder output;

the predicted countermeasure loss function is defined as follows: