CN111243658B

CN111243658B - Biomolecular network construction and optimization method based on deep learning

Info

Publication number: CN111243658B
Application number: CN202010013935.9A
Authority: CN
Inventors: 余国先; 严杨扬; 王峻
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2022-07-22
Anticipated expiration: 2040-01-07
Also published as: CN111243658A

Abstract

The invention relates to a deep learning-based multi-level biomolecular network construction and optimization method, which belongs to the field of artificial intelligence and comprises the following steps: building a software and hardware environment; step two: collecting and preprocessing multi-level biomolecular network data, and primarily establishing a multi-level biomolecular network; step three: collecting biological molecular characteristic data of each biological level, carrying out corresponding characteristic coding, and dividing the data into a training set and a test set; step four: constructing a network optimization model according to the network optimization target and the existing characteristics; step five: training by using multi-level biomolecular network data and processed each level biomolecular characteristic data, solving parameters of each layer in the model, stopping training and storing parameters of each layer when an expected effect is achieved; step six: and deploying the trained neural network model for multi-level biomolecular network optimization. The invention solves the problems that the existing biomolecular network has poor expansibility and cannot deeply describe a complex biological system.

Description

Biomolecular network construction and optimization method based on deep learning

Technical Field

The invention belongs to the technical field of deep learning and system biology, and relates to a biomolecular network construction and optimization method based on deep learning.

Background

The biomolecular network is the core research field of system biology, is the basis of the high-efficiency integrated analysis of biological big data, and is also one of the innovative application fields of artificial intelligence technology in biological data mining. The biological system is formed by complex interaction among various biological molecules, genes are transcribed into RNA, and then are translated and modified to form various protein subtypes, the subtypes have different structures, so that different biological functions are completed, and the dynamic interaction of various biological molecules forms the complex biological system, thereby forming a complex functional mechanism.

A variety of high throughput techniques exist to identify molecular interactions and to represent them by different network models. The multilayer biomolecular network constructed based on multiple groups of chemical data can well describe the functional relationship between molecules and the space-time state of a biological system, a large number of researches show that molecules with the same function are easy to form local clusters and modules in a complex network, and the local block structure has great promotion effects on the research of molecular functions, the organization structure between biomolecules in cells and precise medical treatment. However, many of the known biomolecular networks can only provide partial information on complex physiological phenomena, such as typical complex diseases (breast cancer, lung cancer, etc.) are usually not caused by single gene variation or single pair-wise gene interaction deletion. They are actually caused by multiple genes, abnormalities in intracellular and intercellular molecular interactions.

The construction and modeling of biomolecular networks have shown more than 20 years of research history, however, the existing network biology research still generally only isolates the molecular networks at the level of concerned genome, transcriptome, metabolome or proteome, and it is difficult to analyze the pathology of complex diseases more comprehensively and stereoscopically from the perspective of multi-level molecular networks. The multilevel biomolecular network construction, optimization and visual analysis are not only helpful for disclosing the biomolecular mechanism of complex diseases and various life phenomena, but also helpful for the development of various fields such as precise treatment of complex diseases, drug research and development and the like.

Keras is an open source artificial neural network library written by Python, and can be used as a high-level application program interface of Tensorflow, Microsoft-CNTK and Theano for designing, debugging, evaluating, applying and visualizing a deep learning model. Keras is written by an object-oriented method on a code structure, is completely modularized and has expandability, and supports mainstream algorithms in the field of modern artificial intelligence. In the aspects of hardware and development environment, Keras supports multi-GPU parallel computing under a multi-operating system, and contributes to more efficiently optimizing a multi-level biomolecular network.

Disclosure of Invention

In view of the above, the present invention aims to provide a more comprehensive and reasonably designed method for constructing and optimizing a multi-level biomolecular network, which adopts a deep learning algorithm to construct a model for recognizing the interaction information of the existing biological nodes and comprehensively considering the characteristics of biological node sequences, expression quantities, etc., and then uses the model to optimize the multi-level biomolecular network, thereby having the advantage of comprehensively considering the biological multi-level interaction relationship.

In order to achieve the purpose, the invention provides the following technical scheme:

a biomolecular network construction and optimization method based on deep learning comprises the following steps:

the method comprises the following steps: constructing a software and hardware environment suitable for Keras deep learning operation;

step two: collecting and preprocessing biomolecular network data of multiple layers, aligning the multiple layers of networks, and initially establishing a multiple layers of biomolecular networks;

step three: collecting biological molecule characteristic data of each biological level, carrying out corresponding characteristic coding, and dividing a data set into a training set and a testing set;

step four: according to the network optimization target and the existing characteristics, a network optimization model is constructed by adopting a deep learning algorithm;

step five: on the set deep learning Keras operating environment, training according to the model set up in the fourth step by using the prepared aligned multi-level biomolecule network data and the processed characteristic data of each level of biomolecules, solving the parameters of each layer in the model, stopping training when the expected effect is achieved, and storing the parameters of each layer;

step six: and deploying the trained neural network model for multi-level biomolecular network optimization.

Further, the software and hardware environment suitable for Keras deep learning operation built in the step one comprises: the hardware is a server with 32GB memory and two NVIDIA Tesla K40C independent video cards with 12GB memory or higher configuration; the operating system of the software is Ubuntu16.04, a 64-bit operating system and other third-party libraries which Keras depends on.

Further, in the second step, multi-level biomolecular correlation interaction data is collected from a plurality of public databases, and the specific data is shown in table 1:

TABLE 1 biomolecular interaction correlation data

After the collection and sorting of the biomolecule interaction related data are completed, because the data dimensions of the data sources are not uniform, the data dimensions are uniformly aligned firstly. The method comprises the following steps: taking the gene as a reference, taking intersection of genes in each data source data, aligning respective data dimensions of the remaining biomolecules to preliminarily establish a multi-level biomolecule network, and then carrying out one-hot coding on node associated data of each level.

Further, in the third step, the collection of biological sequence data of each hierarchy is continued to perform feature coding, and meanwhile, the data of the expression quantity (FPKM) of each biomolecule is collected to be spliced with the biological sequence features as supplementary features.

Further, in the fourth step, a multilevel biological network optimization algorithm based on a Keras deep learning algorithm is constructed; and determining a model structure according to the task of optimizing the network, and repeatedly calling an add () function to insert a convolutional layer, a maximum pooling layer, a full link layer and an activation function in a model container model created by the Sequential () function to construct a deep convolutional neural network model.

Further, in the fifth step, the parameters of each layer of the deep network are iteratively updated by a gradient descent algorithm through a training data set, the performance of the model obtained through training is evaluated by using a test data set, and after the expected performance is achieved, the training is stopped and the parameters of each layer are stored.

And further, optimizing the established multi-level biomolecular network by using the established network optimization model in the sixth step and the model parameters obtained by training in the fifth step, and supplementing the associated edges with high predicted probability values.

The invention has the beneficial effects that: the method for constructing and optimizing the multi-level biomolecular network based on deep learning disclosed by the invention constructs a multi-level biomolecular network optimization model by adopting a deep learning algorithm, breaks through the defect that the existing biomolecular network construction generally only focuses on the integration of 1-2 physiological-level molecular data, comprehensively considers the interaction relationship among multi-level biomolecular nodes, introduces characteristics such as biomolecular sequence information, expression quantity information and the like, and has the advantage of higher accuracy of network optimization.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof.

Drawings

For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a technical route diagram for collecting and processing multi-level biomolecular network data according to the present invention;

FIG. 3 is a flow chart of training a multi-level network optimization model according to the present invention;

fig. 4 is a network structure of a multi-level network optimization model in the present invention.

Detailed Description

The following embodiments of the present invention are provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the invention, shown in the drawings are schematic representations and not in the form of actual drawings; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

As shown in fig. 1 to 4, the method for constructing and optimizing a deep learning-based multi-level biomolecular network according to the present embodiment adopts the following technical solutions:

step two: collecting and preprocessing multi-level biomolecular network data, aligning multi-level networks, and preliminarily establishing the multi-level biomolecular network;

step three: collecting biological molecular characteristic data of each biological level, carrying out corresponding characteristic coding, and dividing a data set into a training set and a test set;

step four: according to a network optimization target and the existing characteristics, a network optimization model is constructed by adopting a deep learning algorithm;

Optionally, a software and hardware environment suitable for the deep learning operation of Keras is built in the step one as follows: the hardware is a server with one internal memory of 32GB and two NVIDIA Tesla K40C independent video cards with 12GB internal memory or higher configuration; the operating system of the software is Ubuntu16.04, a 64-bit operating system and other third-party libraries which Keras depends on.

Optionally, multi-level biomolecular correlation interaction data is collected from a plurality of public databases in step two. Specific data are shown in table 1:

TABLE 1 biomolecular interaction correlation data

Serial number	Type of biomolecule	Post-alignment sample dimension	Data source
				1	LncRNA-Disease	240×412	Lncrnadisease
2	LncRNA-miRNA	240×495	Starbase
				3	LncRNA-Gene	240×12663	Lncrna2target
4	LncRNA-GO	240×6428	GeneRIF
				5	miRNA-Disease	495×412	Hmdd
6	miRNA-Gene	495×12663	Mirtarbase
				7	Gene-Gene	12663×12663	Biogrid
8	Gene-GO	12663×6428	Geneontology
				9	Gene-Disease	12663×412	Disgenet
10	Isoform-Gene	31408×12663	ENCODE

After the collection and arrangement of the biomolecule interaction related data are completed, because the data dimensions of the data sources are not uniform, the data dimensions are uniformly aligned firstly. For example, Gene-Disease correlation data collected from the Gene database and isofomm-Gene correlation data collected from the ENCODE database, using genes as reference, intersecting genes in Gene-Disease and isofomm-Gene, and aligning the respective data dimensions to the rest of biomolecules to initially establish a multi-level biomolecule network. And then carrying out one-hot coding on the node associated data of each layer.

Optionally, in the third step, the collection of biological sequence data of each hierarchy for feature coding is continued, and meanwhile, the data of the expression quantity (FPKM) of each biomolecule is collected to be spliced with the biological sequence features as supplementary features.

Optionally, a multilevel biological network optimization algorithm based on the Keras deep learning algorithm is constructed in the fourth step. And determining a model structure according to the task of the optimization network, and repeatedly calling an add () function to insert a convolutional layer, a maximum pooling layer, a full connection layer and an activation function in a model container model created by a Sequential () function to construct a deep convolutional neural network model.

Optionally, in the fifth step, a training data set is used to iteratively update parameters of each layer of the deep network by using a gradient descent algorithm, a test data set is used to perform performance evaluation on the model obtained by training, and after expected performance is achieved, training is stopped and parameters of each layer are stored.

Optionally, the established network optimization model and the model parameters obtained by training in the fifth step are used for optimizing the established multi-level biomolecular network in the sixth step, and the associated edges with high predicted probability values are supplemented.

The specific implementation mode is that a multilevel biomolecular network optimization model of a convolutional neural network is constructed based on a Keras deep learning framework. In one embodiment, the data preparation for the multi-level biomolecular network is to collect multi-level biomolecular related data from each public database as indicated in table 1. Because the dimensions of the biological associated data sorted in each database are not uniform and the biological associated data have respective biological naming modes, the data are named uniformly according to the database naming mapping file. On the basis, unified alignment of data dimensions is performed, taking leave-Gene-Isoform as an example, the specific method is that Gene-leave associated data collected by a leave database and Isoform-Gene associated data collected by an ENCODE database are subjected to Gene-based intersection of genes in the Gene-leave and the Isoform-Gene, then respective data dimensions are aligned to the remaining biomolecules to obtain an aligned leave-Gene-Isoform network, and the data dimensions of the remaining biomolecule networks are also aligned in this way to initially establish a multilayer biomolecule network. And then carrying out one-hot coding on the node associated data of each layer.

Collecting only biomolecule-related data is not sufficient to effectively optimize a multi-level biomolecule network, so that collection of sequence data of each level is continued for feature coding, and meanwhile, data of expression quantity (FPKM) of each level of biomolecules is collected to be spliced with characteristics of the biological sequences as supplementary features. Here, (X, Y) is defined as sample data, taking isofom-Disease association network as an example, where X represents an association relationship (one-hot code) between isofom molecules and other biomolecules and characteristic concatenation of sequence information and expression amount of the isofom molecules, and Y represents an association relationship (one-hot code) in the isofom-Disease association network.

The convolutional neural network model mainly comprises a convolutional layer, a pooling layer, a full-connection layer and a sigmoid layer for classification.

A convolutional layer: each convolution layer is composed of a plurality of convolution kernels and has the function of extracting features, and the most important parameters of the layer are the size of the convolution kernels and the number of the convolution kernels. The convolution kernel is denoted as C_m×nThe size is m × n, and the convolution kernel shift step is denoted by s.

The convolution operation can be described as:

here, the number of the first and second electrodes,

is the output of the i-th convolution kernel of the l-1-layer network, x^l-1As an input to the layer i network,

is the output of the jth convolution kernel of the current layer,

is a parameter of the jth convolution kernel,

fnolineare is the nonlinear operation performed on the convolved data for the bias parameters of the convolution corresponding to the convolution kernel. Common nonlinear activation functions are ReLU, sigmoid and tanh.

A pooling layer: the pooling layer is used for performing down-sampling on the output of the previous layer of the convolutional layer by utilizing a pooling core, namely, the data dimension of the output of the convolutional layer is reduced, and finally, the model parameter scale is reduced. The main parameters of the pooling layer are the size of the pooling core, the pooling core movement step and the pooling pattern. The pooling mode adopted by the invention is maximum pooling, namely the maximum numerical value in the pooling kernel range is taken as output, and the mode can greatly reduce the deviation of the estimated mean value caused by parameter errors of the convolutional layer. The maximum pooling operation can be described as:

here, the number of the first and second electrodes,

to characterize the area on the map covered by the pooling kernel,

all characteristic values within the range are indicated.

Non-linear layer: the data is non-linearly manipulated to increase the complexity of the network. Common nonlinear operations are ReLU, sigmoid, and tanh. The nonlinear operation employed by the present invention is ReLU, which can be described as:

f(x)＝max(0,x)

full connection layer: the full connection layer is that every neuron of the previous layer network is connected with the next layer network. The output number of the last full connection layer is the same as the number of the categories in the data, that is, the output of the last full connection layer corresponds to each category label. This fully connected layer is used to build a supervised identification.

Sigmoid activation function: the Sigmoid function serves as an activation function for the output full connection layer, which can smoothly map the real number domain to the [0,1] space. The function value can be interpreted as a probability value belonging to the prediction class (the probability value ranges from 0 to 1), besides, the sigmoid function is monotonously increased and continuously derivable, and the derivative form is very simple, which is mostly used in the classification task.

The convolutional neural network model is divided into a forward process and a backward process. The forward process is that a category label is output from input data through a plurality of convolution operations, pooling operations, nonlinear operations and full connection, and is compared with a real category label to obtain an error as a loss function. The backward process is a process of backward propagation of errors, and the gradient of the errors relative to all the parameters of the full-connection layer, the non-linear layer, the pooling layer and the convolution layer is calculated layer by layer reversely from the obtained errors.

The convolutional neural network model is trained by adopting a gradient descent algorithm according to error back propagation to calculate the gradient of the error of each layer and update the parameters of each layer along the direction which can make the gradient descend fastest, and finally convergence is achieved, and the network training is shown in figure 3.

And then writing a network structure description file according to the determined network structure of the deep multi-level biomolecular network optimization model and the parameters of each layer of network, wherein the network structure is shown in figure 4. And finally, inputting the data of the test set into a trained optimization model, outputting a probability value of Isofom corresponding to the distance by the model, and selecting the association with high predicted probability value to supplement to the original network.

The invention relates to a method for constructing and optimizing a multi-level biomolecular network based on deep learning, which constructs a multi-level biomolecular network optimization model by adopting a deep learning algorithm, breaks through the defect that the existing biomolecular network construction generally only focuses on the integration of molecular data of 1-2 physiological layers, comprehensively considers the interaction relation among multi-level biomolecular nodes, introduces the characteristics of biomolecular sequence information, expression quantity information and the like, and has the advantage of higher accuracy of network optimization.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A biomolecular network construction and optimization method based on deep learning is characterized in that: the method comprises the following steps:

step two: collecting and preprocessing biomolecular network data of multiple layers, aligning the multiple layers of networks, and initially establishing a multiple layers of biomolecular networks; in the second step, multi-level biomolecule correlation interaction data are collected from a plurality of public databases, and the specific data comprise:

class of biomoleculesModel (II) Post-alignment sample dimension Data source LncRNA-Disease 240×412 Lncrnadisease LncRNA-miRNA 240×495 Starbase LncRNA-Gene 240×12663 Lncrna2target LncRNA-GO 240×6428 GeneRIF miRNA-Disease 495×412 Hmdd miRNA-Gene 495×12663 Mirtarbase Gene-Gene 12663×12663 Biogrid Gene-GO 12663×6428 Geneontology Gene-Disease 12663×412 Disgenet Isoform-Gene 31408×12663 ENCODE

After the collection and arrangement of the biomolecule interaction related data are completed, because the data dimensions of the data sources are not uniform, firstly, the data dimensions are uniformly aligned, and the method comprises the following steps: taking gene as reference, taking intersection of genes in each data source data, aligning each data dimension to the rest biomolecules, preliminarily establishing a multi-level biomolecule network, and performing one-hot coding on node associated data of each level

2. The biomolecular network construction and optimization method based on deep learning of claim 1, wherein: the software and hardware environment which is built in the step one and is suitable for Keras deep learning operation comprises the following steps: the hardware is a server with one internal memory of 32GB and two NVIDIATesla K40C independent video cards with 12GB internal memory or higher configuration; the operating system of the software is Ubuntu16.04, a 64-bit operating system and other third-party libraries which Keras depends on.

3. The deep learning-based biomolecular network construction and optimization method according to claim 1, wherein: and continuously collecting biological sequence data of each level in the third step for feature coding, and simultaneously collecting data of each biological molecule expression quantity and biological sequence feature splicing as supplementary features.

4. The biomolecular network construction and optimization method based on deep learning of claim 1, wherein: in the fourth step, a multilevel biological network optimization algorithm based on a Keras deep learning algorithm is constructed; and determining a model structure according to the task of optimizing the network, and repeatedly calling an add () function to insert a convolutional layer, a maximum pooling layer, a full link layer and an activation function in a model container model created by the Sequential () function to construct a deep convolutional neural network model.

5. The deep learning-based biomolecular network construction and optimization method according to claim 1, wherein: and step five, iteratively updating parameters of each layer of the deep network by using a gradient descent algorithm through a training data set, evaluating the performance of the model obtained by training through a test data set, stopping training and storing the parameters of each layer after the expected performance is achieved.

6. The biomolecular network construction and optimization method based on deep learning of claim 1, wherein: and step six, optimizing the established multi-level biomolecular network by using the established network optimization model and the model parameters obtained by training in the step five, and supplementing the associated edges with high predicted probability values.