CN111276187B

CN111276187B - Gene expression profile feature learning method based on self-encoder

Info

Publication number: CN111276187B
Application number: CN202010029068.8A
Authority: CN
Inventors: 彭绍亮; 张磊; 李非; 毕夏安; 周德山; 肖港; 辛彬; 王子航
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-01-12
Filing date: 2020-01-12
Publication date: 2021-09-10
Anticipated expiration: 2040-01-12
Also published as: CN111276187A

Abstract

The invention belongs to the field of computer science, and discloses a gene expression profile feature learning method based on a self-encoder. The invention gives full play to the advantage that deep learning can accurately acquire data characteristics, combines the characteristics learning capability of multiple channels of a convolutional neural network and an autoencoder, and designs a new multi-channel autoencoder model, thereby better learning the characteristics representation with relatively low dimensionality and effectively distinguishing the original gene expression spectrum data.

Description

Gene expression profile feature learning method based on self-encoder

Technical Field

The invention relates to a gene expression profile characteristic learning method based on an autoencoder, belonging to the field of computer science.

Background

The LINCS gene expression profile data covers the expression quantity condition of the whole genome of each cell line of a human body under various different experimental environments, embodies the cell overall information of different drugs under the conditions of different doses and time in an in vitro cell model, and provides necessary basic data for computational analysis and experimental verification. The silencing and over-expression of more than 4000 genes and 130 or more than 7000 molecules of small molecule compounds in 77 typical cell lines are obtained, and necessary biological repeated experiments are included. However, as long as the experimental conditions are slightly different, the gene expression level of the same cell line is also very different, and moreover, the expression level of the whole genome of the human body is more than twenty thousand, if the whole expression level is directly used as the sample characteristic, redundancy may exist among the characteristics, and if the high-dimensional characteristics are directly used, the difficulty of model training is increased, and the performance of the model is affected. Although there are many deep learning models applied to gene expression data having a characteristic of a high-dimensional small sample, there is a problem in that the computational complexity is relatively high. Therefore, before using deep learning models, a suitable feature representation learning method is required to reduce the dimensionality of the gene expression data.

In order to solve the problem, it is necessary to invent a gene expression profile feature learning method, which extracts a feature representation with relatively low dimensionality from high-dimensional sparse gene expression profile data and can effectively distinguish original data, so as to be better applied to gene expression profile classification and the like.

Disclosure of Invention

In order to achieve the above object, the present invention provides a gene expression profile feature learning method based on an autoencoder. The feature learning is mainly divided into the conventional feature learning and the feature learning based on the neural network. In the traditional feature learning, linear mapping is carried out on data by learning a projection matrix, the data is converted into a low-dimensional space from a high-dimensional feature space, and the differentiability of the data is improved, so that better expression of the data is obtained. And the characteristic learning capability is improved by linear transformation of input data and a nonlinear activation function of a neuron based on the characteristic learning of the neural network. The invention gives full play to the advantage that deep learning can accurately acquire data characteristics, combines the characteristics learning capability of multiple channels of a convolutional neural network and an autoencoder, and designs a new multi-channel autoencoder model, thereby better learning the characteristics representation with relatively low dimensionality and effectively distinguishing the original gene expression spectrum data.

The technical scheme adopted by the invention is as follows:

a gene expression profile characteristic learning method based on an autoencoder comprises the following steps:

step 1, importing a gene expression profile file at a LINCS data set Level3 stage, and extracting required data according to a cell line, an experiment type, a disturbance condition and the like;

step 2, processing the extracted sample data to obtain input data which can be directly used for a feature learning model;

step 3, extracting feature representation with relatively low dimensionality and capable of effectively distinguishing original data from high-dimensional sparse gene expression profile data;

the step is combined with the multi-channel of the convolutional neural network and the characteristic learning capability of the self-encoder to design a new multi-channel self-encoder model, and the performance of the model is greatly improved compared with that of the original self-encoder.

And 4, verifying the quality of the characteristics predicted by the model, calculating the prediction accuracy rate by taking the gene expression profile experimental group as a classification label based on a KNN classification method, wherein the higher the accuracy rate is, the better the characteristics learned by the multi-channel self-encoder are represented in the classification task.

Preferably, in step 1, the information of the required cell line, experiment type, disturbance condition and the like is screened out as required according to the gene information and cell line information in the LINCS data set attachment and the gene expression profile sample information, the sample information is used as a screening standard according to the uniqueness of each gene expression profile sample information, and finally, the sample of the required data is searched according to the sample identifier, and the expression profile value of the sample is the required data.

Preferably, the step 2 comprises the following steps:

step 2.1, reading the extracted gene expression profile sample data;

and 2.2, manually adding label information according to the sample types, setting the read sample label of the first type as 0, setting the sample label of the second type as 1, and sequentially and alternately setting.

Preferably, the step 3 comprises the following steps:

step 3.1, reading sample data and sample label data, and dividing a training set and a verification set;

step 3.2, initializing an Adam optimizer, setting the learning rate to be 0.003, and selecting the MAE by a loss function;

3.3, constructing a depth multichannel self-encoder model, wherein a multichannel self-encoder (MCAE) comprises an encoding process and a decoding process, a directional three-layer neural network structure is adopted, namely an input layer, a hidden layer and an output layer, wherein the input layer and the output layer have the same dimensionality and are both n-dimensional, and the dimensionality of the hidden layer is m-dimensional;

step 3.4, putting the training set into a multi-channel self-encoder for model training, saving the optimal weight according to the loss obtained by the verification set, and loading the optimal weight after the model training is finished to obtain an optimal model;

step 3.5 predicts a feature representation of the sample data using the optimal model.

Preferably, in step 3.3, the encoding process from the input layer to the hidden layer is to implement dimension reduction on the input data, and the encoding is used as a feature representation on the data, and the encoding function is represented as f, then:

h_i＝f_i(x)＝S_f(w_ix + p), where h is the final feature representation, h_iFor each channel' S feature representation, n is the number of channels, S_fFor the encoder activation function, the ReLu function is taken, i.e. s (x) max (0, x), w_iFor the weight matrix between the ith channel of the input layer and the hidden layer, p ∈ R^mIs a bias term; fromDecoding from the hidden layer to the input layer, and decoding the code obtained by the hidden layer to reconstruct the input data; expressing the coding function as g, then:

y_i＝g(h_i)＝S_g(wh_i+ q) where y is the final reconstruction data, y_iFor the reconstruction data of each channel, n is the number of channels, S_gFor decoder activation of functions, taking Sigmod functions, i.e.

w is a weight matrix between the hidden layer and the output layer, and q belongs to RⁿIs a bias term;

superposing a plurality of multi-channel self-encoders layer by layer to obtain a depth multi-channel self-encoder model containing a plurality of hidden layers, and expressing by 2-dimensional features: a first layer 978 × 1 of the depth multichannel self-encoder, wherein 978 is a feature dimension, 1 is a channel number, 5 times of weights are initialized by using different random seeds, and the feature of 978 dimensions is input by using a Dense layer respectively to obtain a second layer 128 × 5, namely 5 layers 128 × 1; the second layer 128 x 5 initializes 4 times of weights with different random seeds, the first weight uses a Dense layer to input 5 128-dimensional features respectively to obtain 5 2-dimensional features, then the average value is calculated according to the corresponding position to obtain the first 2 x 1 of the third layer, and the rest is repeated to obtain the second, third and fourth 2 x 1 of the third layer, namely the third layer 2 x 4, the fourth layer 128 x 5 and the fifth layer 978 x 1.

Preferably, in the step 4, firstly, the feature data and the sample label data are read, the training set and the test set are divided, then, the KNN classes are instantiated, the n _ neighbors parameters are sequentially taken to be 1-10, then, the training set is used for fitting the KNN model, and finally, the trained KNN model is used for predicting the sample label of the test set, which is the prediction accuracy.

Drawings

FIG. 1 is a schematic diagram of a multi-channel self-encoder according to the present invention;

FIG. 2 is a schematic diagram of a depth multi-channel self-encoder according to the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

Preferably, the step 2 comprises the following steps:

step 2.1, reading the extracted gene expression profile sample data;

Preferably, the step 3 comprises the following steps:

h_i＝f_i(x)＝S_f(w_ix + p), where h is the final feature representation, h_iFor each channel' S feature representation, n is the number of channels, S_fFor the encoder activation function, the ReLu function is taken, i.e. s (x) max (0, x), w_iFor the weight matrix between the ith channel of the input layer and the hidden layer, p ∈ R^mIs a bias term; decoding the code obtained by the hidden layer to reconstruct the input data in the decoding process from the hidden layer to the input layer; expressing the coding function as g, then:

y_i＝g(h_i)＝S_g(wh_i+ q) where y is the final reconstruction numberAccording to y_iFor the reconstruction data of each channel, n is the number of channels, S_gFor decoder activation of functions, taking Sigmod functions, i.e.

w is a weight matrix between the hidden layer and the output layer, and q belongs to RⁿIs a bias term; the schematic diagram of the network structure is shown in fig. 1.

Superposing a plurality of multi-channel self-encoders layer by layer to obtain a depth multi-channel self-encoder model containing a plurality of hidden layers, and representing by 2-dimensional features (as shown in fig. 2): a first layer 978 × 1 of the depth multichannel self-encoder, wherein 978 is a feature dimension, 1 is a channel number, 5 times of weights are initialized by using different random seeds, and the feature of 978 dimensions is input by using a Dense layer respectively to obtain a second layer 128 × 5, namely 5 layers 128 × 1; the second layer 128 x 5 initializes 4 times of weights with different random seeds, the first weight uses a Dense layer to input 5 128-dimensional features respectively to obtain 5 2-dimensional features, then the average value is calculated according to the corresponding position to obtain the first 2 x 1 of the third layer, and the rest is repeated to obtain the second, third and fourth 2 x 1 of the third layer, namely the third layer 2 x 4, the fourth layer 128 x 5 and the fifth layer 978 x 1.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A gene expression profile feature learning method based on a self-encoder is characterized by comprising the following steps:

step 1, importing a gene expression profile file at a LINCS (LINCS) data set Level3 stage, and extracting required data according to a cell line, an experiment type, a disturbance condition and the like;

screening information such as required cell lines, experiment types, disturbance conditions and the like according to gene information, cell line information and gene expression profile sample information in the LINCS data set accessory, taking the sample information as a screening standard according to the uniqueness of each gene expression profile sample information, and finally searching a sample of required data according to a sample identifier, wherein the expression profile value of the sample is the required data;

step 2, processing the extracted sample data to obtain input data which can be directly used for a feature learning model; the method specifically comprises the following steps:

step 2.1, reading the extracted gene expression profile sample data;

step 2.2, manually adding label information according to the sample types, setting the read sample label of the first type as 0, setting the sample label of the second type as 1, and sequentially and alternately setting;

step 3, extracting feature representation with relatively low dimensionality and capable of effectively distinguishing original data from high-dimensional sparse gene expression profile data; the method specifically comprises the following steps:

3.3, constructing a depth multi-channel self-encoder model, wherein the multi-channel self-encoder comprises two processes of encoding and decoding, a directional three-layer neural network structure is adopted, namely an input layer, a hidden layer and an output layer, the input layer and the output layer have the same dimensionality and are both n-dimensional, and the dimensionality of the hidden layer is m-dimensional;

the coding process from the input layer to the hidden layer is realized to reduce the dimension of the input data, the coding is used as the characteristic representation of the data, the coding function is represented as f, and then：

superposing a plurality of multi-channel self-encoders layer by layer to obtain a depth multi-channel self-encoder model containing a plurality of hidden layers, and expressing by 2-dimensional features: a first layer 978 × 1 of the depth multichannel self-encoder, wherein 978 is a feature dimension, 1 is a channel number, 5 times of weights are initialized by using different random seeds, and the feature of 978 dimensions is input by using a Dense layer respectively to obtain a second layer 128 × 5, namely 5 layers 128 × 1; the second layer 128 x 5 initializes 4 times of weights by using different random seeds, the first weight respectively uses a Dense layer to input 5 128 dimensional features to obtain 5 2 dimensional features, then the average value is calculated according to the corresponding position to obtain the first 2 x 1 of the third layer, and the rest is repeated to obtain the second, third and fourth 2 x 1 of the third layer, namely the third layer 2 x 4, the fourth layer 128 x 5 and the fifth layer 978 x 1;

step 3.5, predicting the characteristic representation of the sample data by using the optimal model;

step 4, verifying the quality of the characteristics of model prediction, taking a gene expression profile experimental group as a classification label, and calculating the accuracy of prediction based on a KNN classification method, wherein the higher the accuracy is, the better the characteristics learned by a multi-channel self-encoder are represented in the classification task;

firstly, reading characteristic data and sample label data, dividing a training set and a test set, then instantiating KNN classes, sequentially taking 1-10 n _ neighbors parameters, then using the training set to fit a KNN model, and finally using the trained KNN model to predict the sample label of the test set, wherein the prediction accuracy is the sample label.