CN116013409A

CN116013409A - method, system and storage medium for miRNA target gene prediction and model training thereof

Info

Publication number: CN116013409A
Application number: CN202211615351.4A
Authority: CN
Inventors: 张紫阳; 王诺; 王勇斯; 黎彦伶; 范文涛; 温韵洁; 全智慧; 裘宇容
Original assignee: Guangzhou Huayin Medical Laboratory Center Co Ltd
Current assignee: Guangzhou Huayin Medical Laboratory Center Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-04-25

Abstract

The application provides a method, a system and a storage medium for miRNA target gene prediction and model training thereof. The training method of the miRNA target gene prediction model comprises the following steps: projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector; extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model; inputting the extracted feature tensor into a full connection layer to obtain a prediction result; comparing the obtained prediction result with a reference result; and optimizing the miRNA target gene prediction model based on the results of the alignment.

Description

method, system and storage medium for miRNA target gene prediction and model training thereof

Technical Field

The present application relates to bioinformatics, and more particularly, to a method, system, and storage medium for miRNA target gene prediction and model training thereof.

Background

micrornas (hereinafter "mirnas") are a class of non-coding single-stranded RNA molecules encoded by endogenous genes, which are typically about 18-25nt in length. Mirnas have important regulatory roles in the course of biological development. mirnas achieve negative regulation of target gene expression mainly through posttranscriptional regulation of genes, and specific modes of action are mainly translational inhibition (commonly found in animals) and degradation of target genes (commonly found in plants). miRNA plays an extremely important role in the aspects of tumor development, biological development, organ formation, virus defense, apparent regulation, metabolism and the like. Understanding the miRNA regulated target gene has important significance for tumor prevention and treatment and other diseases.

However, mirnas have very complex regulatory networks, often one miRNA can regulate multiple target genes, and the same target gene can also have multiple mirnas to regulate. Currently, software such as miRanda, RNAhybrid, PITA, targetScan is widely used in the industry to predict the target gene of miRNA. The idea of the software is to calculate the complementary pairing situation of miRNA and target gene and further judge whether the target gene is the gene interacted with miRNA according to the thermodynamic stability of the combination of miRNA and target gene. Although such software considers the base complementarity of the specific sequence of the miRNA and the target gene, the cross-species conservation of the untranslated region of the target gene, and the thermodynamic stability of the miRNA and the target gene dimer in the prediction process, and such methods can be used for any species with relatively small calculation, the algorithm of such software does not truly reflect the interaction biological mechanism of the miRNA and the target gene. Therefore, the accuracy of such software for miRNA target gene prediction is often less than 40%. This greatly increases the workload of post-verification, and thus the time cost and economic cost are not low. In addition, existing miRNA target gene prediction software (for example miranda, PITA, RNAhybrid) generally requires a user to input information such as a thermodynamic energy threshold, a score threshold, and a target upstream and downstream position to be considered when performing target gene prediction, and these information have a certain preferential interference on a prediction result (especially in the case that the action mechanism of miRNA and target gene is not actually understood at present), and on the other hand, some use inconveniences are brought. Therefore, there is a need in the market for a relatively efficient, accurate, and convenient method for predicting miRNA target genes.

Disclosure of Invention

The application provides a training method of a miRNA target gene prediction model, which comprises the following steps: projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector; extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model; inputting the extracted feature tensor into a full connection layer to obtain a prediction result; comparing the obtained prediction result with a reference result; and optimizing the miRNA target gene prediction model based on the results of the alignment.

According to an embodiment of the present application, the sequence of the miRNA and the sequence of the mRNA include the sequence of the miRNA and the sequence of the mRNA that interact with each other in a positive training set and the sequence of the miRNA and the sequence of the mRNA that are randomly generated in a negative training set, wherein: the sequence of the miRNA and the sequence of the mRNA which are interacted with each other in the positive training set are the sequence of the miRNA which is extracted from at least one database in ENCORI, miRDB, miRTarBase, miRNet, miRWalk and verified by low-flux experiments and the sequence of the corresponding mRNA; the sequence of the randomly generated mirnas and the sequence of the mrnas in the negative training set are randomly generated sequences, and the randomly generated sequences exclude the sequence of the mirnas and the sequence of the mrnas in the positive training set that interacted with each other and the sequence of the mirnas and the sequence of the mrnas that interacted predicted by miRanda, RNAhybrid and PITA.

According to an embodiment of the present application, the logarithm of the sequence of the miRNA and the sequence of the mRNA that interact with each other in the positive training set is identical to the logarithm of the sequence of the miRNA and the sequence of the mRNA that are randomly generated in the negative training set.

According to an embodiment of the present application, the four-stage sequentially connected convolution layer, activation layer and pooling layer includes: a first stage with 16 4-channel convolution kernels, a relu activation function, and a max pooling layer with window 2; a second stage with 32 16-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2; a third stage with 64 32-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2; a fourth stage with 128 64-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2.

According to an embodiment of the present application, inputting the extracted feature tensor into the fully connected layer to obtain the prediction result includes: the extracted feature tensor is reduced to be a one-dimensional first vector; randomly pruning half of the elements from the first vector to form a second vector; inputting the second vector to a first fully connected layer with relu as an activation function to obtain a third vector; randomly pruning half of the elements from the third vector to form a fourth vector; the fourth vector is input to a second fully connected layer with sigmoid as an activation function to obtain the prediction result.

According to an embodiment of the present application, the optimizing the miRNA target gene prediction model based on the results of the alignment comprises: and correcting parameters of each layer of the miRNA target gene prediction model by using the cross entropy of the two classes as a loss function and using an Adam optimization function.

The application also provides a training system of the miRNA target gene prediction model, which comprises: a memory storing executable instructions; and one or more processors in communication with the memory to execute the executable instructions to: projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector; extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model; inputting the extracted feature tensor into a full connection layer to obtain a prediction result; comparing the obtained prediction result with a reference result; and optimizing the miRNA target gene prediction model based on the results of the alignment.

The present application also provides a computer-readable storage medium for training of miRNA target gene prediction models, characterized in that the computer-readable storage medium stores executable instructions executable by one or more processors to: projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector; extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model; inputting the extracted feature tensor into a full connection layer to obtain a prediction result; comparing the obtained prediction result with a reference result; and optimizing the miRNA target gene prediction model based on the results of the alignment.

The application also provides a method for predicting miRNA target genes, which comprises the following steps: inputting a target miRNA sequence and a target mRNA sequence into a miRNA target gene database integrated based on an existing database to query whether the target miRNA sequence interacts with the target mRNA sequence; in response to not querying in the miRNA target gene database that the target miRNA sequence interacted with the target mRNA sequence, inputting the target miRNA sequence and the target mRNA sequence into a miRNA target gene prediction model trained according to the training methods provided herein to predict whether the target miRNA sequence interacted with the target mRNA sequence.

According to an embodiment of the present application, a miRNA target gene database integrated based on an existing database includes: the sequence of miRNAs and the corresponding sequences of mRNAs which are extracted from at least one database in ENCORI, miRDB, miRTarBase, miRNet, miRWalk and are verified by low-throughput experiments and interact with each other; and the sequence of miRNA and the sequence of corresponding mRNA which are verified by high-throughput sequencing after screening by a preset threshold value.

The training method of the miRNA target gene prediction model creatively converts the interaction judgment of the miRNA sequence and the mRNA sequence into the calculation task of image-like processing, so that the task of miRNA target gene prediction can be performed by using a Convolutional Neural Network (CNN), and a more efficient, accurate and convenient miRNA target gene prediction scheme is provided.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is a flow chart of a method of training a predictive model of miRNA target genes according to an embodiment of the present application;

FIG. 2 is a schematic illustration of projecting sequence data into a base space according to an embodiment of the present application;

FIG. 3 is a flow chart of a method of predicting a miRNA target gene according to an embodiment of the present application; and

FIG. 4 is a schematic block diagram of a miRNA target gene prediction model training system and a miRNA target gene prediction system according to an embodiment of the present application;

fig. 5 is a control schematic diagram of miRNA target gene prediction results according to an embodiment of the present application.

Detailed Description

For a better understanding of the present application, a more detailed description of the technical solution of the present application will be made with reference to the accompanying drawings. It should be understood that the detailed description is merely illustrative of exemplary embodiments of the application and is not intended to limit the scope of the application in any way. Like reference numerals refer to like elements throughout the specification. The expression "and/or" includes any or all combinations of one or more of the associated listed items.

It should be noted that in this specification, expressions of "first", "second", "third", etc. are used only to distinguish one feature from another feature, and do not represent any limitation on the feature. Thus, a first vector discussed below may also be referred to as a second vector without departing from the teachings of the present application. And vice versa.

In the drawings, the size, proportion, and shape of the drawings have been slightly adjusted for convenience of explanation. The figures are merely examples and are not drawn to scale. As used herein, the terms "about," "approximately," and similar terms are used as terms of a table approximation, not as terms of a table degree, and are intended to account for inherent deviations in measured or calculated values that will be recognized by one of ordinary skill in the art.

It will be further understood that terms such as "comprises," "comprising," "includes," "including," "having," "contains," and/or "containing" are open-ended, rather than closed-ended, terms that specify the presence of the stated features, elements, and/or components, but do not preclude the presence or addition of one or more other features, elements, components, and/or groups thereof. Furthermore, when a statement such as "at least one of the list of features" appears after the list of features, it modifies the entire list of features rather than just a single feature in the list. Furthermore, when describing embodiments of the present application, use of "may" means "one or more embodiments of the present application. Also, the term "exemplary" is intended to refer to an example or illustration.

Unless otherwise defined, all terms (including engineering and technical terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In addition, features in the embodiments and examples of the present application may be combined with each other without conflict. In addition, unless explicitly defined or contradicted by context, the particular steps included in the methods described herein are not necessarily limited to the order described, but may be performed in any order or in parallel. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As described above, the mechanism of interaction between miRNA and target gene has not been completely understood at present, and thus, the existing software for miRNA target gene prediction does not truly reflect the mechanism of interaction biology between miRNA and target gene in the algorithm itself. This results in lower prediction accuracy of existing miRNA target gene prediction software.

Deep learning is a rapidly developing field in recent years, and is characterized in that data classification features can be autonomously learned from data, so that the deep learning is widely applied to complex feature recognition and classification tasks. The breakthrough development of deep learning in recent years provides a new possibility for miRNA target gene prediction.

In tasks such as natural language recognition, cyclic neural networks (RNNs) in deep learning are widely used. miRNA sequences and mRNA sequences have some natural language characteristics in terms of sequence composition, e.g., miRNA sequences and mRNA sequences, like natural language, are sequences that are composed of characters in sequence. Thus, when using deep learning to process miRNA target gene predictions, it may be relatively easy to model miRNA target gene predictions using RNNs.

However, applicants have appreciated that gene sequences differ from natural language in many ways, including but not limited to, gene sequences do not have a strong association on top of each other Wen Yuyi as natural language does. In fact, the gene sequences exhibit a strong disorder. Thus, the present application innovatively proposes a concept for deconstructing miRNA and mRNA sequences based on image concepts.

Specifically, both the miRNA sequence and the mRNA sequence consist of adenine (a), uracil (U), guanine (G), and cytosine (C). Thus, the present application suggests that different base information in miRNA and mRNA sequences can be processed with reference to such color channels as R, G, B in the image. In this case, a Convolutional Neural Network (CNN) may be used to model miRNA target gene predictions, with key features being convolved in the input layer.

FIG. 1 is a flow chart of a method of training a predictive model of miRNA target genes according to an embodiment of the present application.

In step S1010, input data obtained by splicing the sequence of the miRNA and the sequence of the mRNA is projected into the a base space, the U base space, the G base space, and the C base space, thereby obtaining an a base vector, a U base vector, a G base vector, and a C base vector.

When input data is projected to the A base space, the U base space, the G base space, and the C base space, only elements at base positions having the same names as those of the base spaces are assigned 1, and elements at the remaining base positions are assigned 0. Specifically, referring to fig. 2, it is assumed that input data 2000 formed by splicing the sequence of miRNA and the sequence of mRNA is [ acuguucg ]. An a base vector 2100 obtained by projecting the input data 2000 into the a base space [ 10000100 ]; a U base vector 2200 obtained by projecting the input data 2000 into the U base space [ 00101000 ]; a G base vector 2300 obtained by projecting the input data 2000 into the G base space [ 00010001 ]; the vector 2400 of C bases obtained by projecting the input data 2000 into the C base space is [ 01000010 ]. After data projection, the a, U, G, and

C base vectors

2100, 2200, 2300, 2400 are similar to input data for different color channels (e.g., R, G, B channels) in an image processing task.

In step S1020, feature tensors are extracted from the a base vector, the U base vector, the G base vector, and the C base vector using the convolution layer, the activation layer, and the pooling layer, which are sequentially connected at four levels of the miRNA target gene prediction model. In step S1020, feature extraction may be performed on different base vectors with reference to the process of performing feature extraction on input vectors of different color channels in the image processing task.

In step S1030, the extracted feature tensor is input to the full connection layer to obtain a prediction result. The full connection layer may be connected to all nodes of the previous layer, thereby integrating the attributes of all features.

In step S1040, the obtained prediction result is compared with a reference result (group Truth). The predicted outcome may be the confidence of the sequence interaction of the miRNA and the mRNA.

Finally, in step S1050, the miRNA target gene prediction model is optimized based on the results of the alignment. For example, parameters of each layer of the miRNA target gene prediction model can be modified by back propagation based on the results of the alignment.

The training method of the miRNA target gene prediction model creatively converts the interaction judgment of the miRNA sequence and the mRNA sequence into the calculation task of image-like processing, so that the task of miRNA target gene prediction can be performed by using a Convolutional Neural Network (CNN), and a more efficient and accurate miRNA target gene prediction scheme is provided.

According to the present application, the sequence of the miRNA and the sequence of the mRNA include the sequence of the miRNA and the sequence of the mRNA that interact with each other in a positive training set and the sequence of the miRNA and the sequence of the mRNA that are randomly generated in a negative training set, wherein: the sequence of the miRNA and the sequence of the mRNA which are interacted with each other in the positive training set are the sequence of the miRNA which is extracted from at least one database in ENCORI, miRDB, miRTarBase, miRNet, miRWalk and verified by low-flux experiments and the sequence of the corresponding mRNA; the sequence of the randomly generated mirnas and the sequence of the mrnas in the negative training set are randomly generated sequences, and the randomly generated sequences exclude the sequence of the mirnas and the sequence of the mrnas in the positive training set that interacted with each other and the sequence of the mirnas and the sequence of the mrnas that interacted predicted by miRanda, RNAhybrid and PITA.

To better train the miRNA target gene prediction model, a labeled training set needs to be prepared.

According to the present application, first, gene data of an authoritative database such as ENCORI, miRDB, miRNET, miRTarBase and mirlalk (hereinafter referred to as "existing database") can be collected, and then these data can be screened. The data in the databases verified by low-throughput experiments such as fluorescence quantitative PCR, northern blot, luciferase report, western blot, CLIP and the like can be selected as positive data. To better train the miRNA target gene prediction model, these data from different databases can be cleaned and integrated in advance. For example, entries in the database described above with mRNA sequences less than 10nt in length can be screened out, and redundant data from different databases removed. For another example, the format of the data may be unified [ the name of miRNA; a sequence of a miRNA; names of mRNAs; the sequence of mRNA).

In the process of integrating the data of the existing database, directly extracting the sequence of an action site from the database with the action site, and randomly complementing the sequence of the miRNA target gene to about 90 bp; for databases without sites of action, RNAhybrid is used to find the site of interaction, and then the corresponding sequence is complemented.

The sequence of miRNA is complemented to simulate the situation that may occur in the future when the model is actually used. In the actual use of the model, the sequence length of the mRNA input into the miRNA target gene prediction model is usually greater than 90bp, for example, more than 500 bp. If the sequence length of the mRNA input to the miRNA target gene prediction model is smaller than that of the mRNA input to the model when the model is trained during actual use of the model, the model may fail to predict the miRNA target gene due to lack of sufficiently long mRNA sequence information during actual use of the model. However, if the sequence length of the inputted mRNA is longer than that of the mRNA inputted to the model at the time of training the model, the sequence of the inputted mRNA may be "data-grouped" by a "sliding window method", the grouped data may be combined with miRNA (sequence length is usually around 20 bp) one by one, and finally inputted to a miRNA target gene prediction model, thereby predicting the target gene.

According to the application, the length of the sliding window is about 90bp, and the step length of each sliding of the sliding window is 20bp. For example, assuming that the sequence length of the mRNA inputted into the miRNA target gene prediction model is 130bp, the sequence data obtained through the sliding window are data of 1bp to 90pb, data of 21bp to 110bp, and data of 41 to 130bp, respectively.

In preparing negative data, miRNA and mRNA sequences may be randomly generated and paired. However, in this process, miRNA and mRNA sequence pairs in existing databases that have been validated for interaction should be excluded, and miRNA and mRNA sequence pairs that are predicted for interaction via miRanda, RNAhybrid and PITA (even if the confidence of the outcome of these predicted interactions is low).

After the positive and negative data are obtained, the positive and negative data may be prepared into data packets at a ratio of 1:1. Most of the data in the data packet will be partitioned into Training sets (Training sets) for Training miRNA target gene prediction models; while the remaining data will be partitioned into test sets (Testing sets) for Testing the training results of the miRNA target gene prediction model.

For example, according to one embodiment of the present application, 96.8% of the data in the data packet is selected into the training set, while the remaining 3.2% of the data is selected into the test set.

According to the application, the convolution layer, the activation layer and the pooling layer which are sequentially connected in the four stages comprise: a first stage with 16 4-channel convolution kernels, a relu activation function, and a max pooling layer with window 2; a second stage with 32 16-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2; a third stage with 64 32-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2; and a fourth stage with 128 64-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2.

The data in the training set, that is, the input data obtained by splicing the sequence of the miRNA and the sequence of the mRNA is converted into a base vector, U base vector, G base vector, and C base vector after projection. Since the length of miRNA is generally about 20bp, and the length of mRNA in a training set is generally complemented to about 90bp, the lengths of A base vector, U base vector, G base vector and C base vector in this application are set to 110.

The base vectors of these four channels are input to the convolutional layer of the first stage. The first level of convolution layer includes 16 4-channel convolution kernels to convolve the four-channel base vectors. According to the method and the device, zero padding can be carried out on the data edge before convolution, so that the size of data obtained after convolution is consistent with the size of a base vector before convolution. The convolved data may then be provided with a relu activation function to provide a non-linear characteristic. The activated data is input to the max pooling layer with window 2 to reduce the data size and prevent overfitting. Under the application scenario of the present application, compared to the pooling schemes such as average pooling, the max pooling scheme is more effective in preventing the occurrence of the over-fitting condition. Thus, after the first stage, an intermediate feature tensor of 16 channels with a length of 55 can be obtained.

Similarly, the intermediate feature tensors of the 16 channels are input to the convolutional layers of the second stage. The second level of convolution layer includes 32 16-channel convolution kernels to convolve the intermediate feature tensors of the 16 channels. According to the method and the device, zero padding can be carried out on the data edge before convolution, so that the size of data obtained after convolution is consistent with the size of the intermediate characteristic tensor before convolution. The convolved data may then be provided with a relu activation function to provide a non-linear characteristic. The activated data is input to the max pooling layer with window 2 to reduce the data size and prevent overfitting. Thus, after the second stage, an intermediate feature tensor of channel length 28 can be obtained 32.

The intermediate feature tensors of the 32 channels are input to the third stage of convolutional layers. The third level of convolution layer includes 64 32-channel convolution kernels to convolve the intermediate feature tensors of the 32 channels. According to the method and the device, zero padding can be carried out on the data edge before convolution, so that the size of data obtained after convolution is consistent with the size of the intermediate characteristic tensor before convolution. The convolved data may then be provided with a relu activation function to provide a non-linear characteristic. The activated data is input to the max pooling layer with window 2 to reduce the data size and prevent overfitting. Thus, after the third stage, an intermediate feature tensor of 64 channels of length 14 can be obtained.

The intermediate feature tensors of the 64 channels are input to the convolutional layers of the fourth stage. The fourth level of convolution layer includes 128 64-channel convolution kernels to convolve the intermediate feature tensors of the 64 channels. According to the method and the device, zero padding can be carried out on the data edge before convolution, so that the size of data obtained after convolution is consistent with the size of the intermediate characteristic tensor before convolution. The convolved data may then be provided with a relu activation function to provide a non-linear characteristic. The activated data is input to the max pooling layer with window 2 to reduce the data size and prevent overfitting. Thus, after the fourth stage, a feature tensor of 128 channels of length 7 can be obtained.

The feature tensor of the 128 channels may be subjected to a dimension reduction operation by means of flat, so as to obtain a one-dimensional vector with a length of 896, which is hereinafter referred to as a first vector. Half of the elements may then be randomly pruned to form a second vector. The random pruning of half of the elements can effectively prevent the occurrence of data overfitting. The second vector is input to a first fully connected layer of 128 cells with relu as the activation function. In this application, the data processed by the first full link layer is referred to as a third vector. The elements of half of the third vector may again be randomly pruned to form a fourth vector. The fourth vector can then be input to a second fully-connected layer consisting of 1 unit with sigmoid as an activation function, thereby obtaining a miRNA target gene prediction result.

After comparing the predicted result with the reference result, the difference between the two is used as a feedback value to optimize the model parameters. Specifically, adam optimization functions can be used to modify parameters of each layer of the miRNA target gene prediction model with a bisectional cross entropy as a loss function.

Fig. 3 is a flow chart of a method of predicting miRNA target genes according to an embodiment of the present application.

Referring to fig. 3, the method 3000 of predicting mirna target genes includes two steps. In step S3010, the target miRNA sequence and the target mRNA sequence are input to an miRNA target gene database integrated based on an existing database to query whether the target miRNA sequence interacts with the target mRNA sequence. The existing database may be a database such as ENCORI, miRDB, miRTarBase, miRNet, miRWalk. The integrated database not only comprises the sequence of miRNAs and the corresponding sequences of mRNAs which are extracted from the existing database and verified by low-throughput experiments and interact with each other, but also comprises the sequence of miRNAs and the sequences of the corresponding mRNAs which are verified by high-throughput sequencing after being screened by a preset threshold value. The threshold may be set higher to ensure the quality of the database.

If the target miRNA sequence and the target mRNA sequence interact with each other in the integrated miRNA target gene database, the query result is directly returned. If the target miRNA sequence and the target mRNA sequence are not queried in the miRNA target gene database for interaction, then in step S3020, the target miRNA sequence and the target mRNA sequence are input into a miRNA target gene prediction model trained with reference to the training methods described above to predict whether the target miRNA sequence and the target mRNA sequence interact.

The method for predicting the miRNA target gene can fully utilize the existing gene database, can predict the interaction relationship between miRNA and mRNA which are not recorded by the existing gene data through the strong learning ability of CNN, and is a relatively efficient and accurate method for predicting the miRNA target gene.

One validated example of a miRNA target gene prediction model proposed according to the present application is described below with reference to table 1.

First, the species "human" is selected from the miRBase functional network and mirnas derived from human, such as hsa-miR-125b-5p, are arbitrarily selected therefrom.

TABLE 1 list of hsa-miR-125b-5p related information in miRBase functional network

Then, one mRNA (STARD 13) with the highest matching degree is selected from the database and the sequence thereof is acquired.

The sequence of the above-mentioned miRNA and the sequence of the mRNA are input into a miRNA target gene prediction model proposed according to the present application. For verification of the model quality, the model did not use the data in miRBase as training set. The final output of the model is: true;0.76700366. wherein True indicates that the miRNA interacted with mRNA and the number indicates the probability of interaction.

As a control, the sequence of the above miRNA and a randomly selected sequence that does not have an interaction relationship with the miRNA were input into a miRNA target gene prediction model proposed according to the present application, and the final output of the model was: false (no interaction).

The application also provides a miRNA target gene prediction system and a model training system thereof, which can be realized in the forms of a mobile terminal, a Personal Computer (PC), a tablet personal computer, a server and the like. Referring now to fig. 4, a schematic diagram of a system suitable for use in implementing embodiments of the present application is shown.

As shown in fig. 4, the computer system includes one or more processors, communication sections, etc., such as: one or more Central Processing Units (CPUs) 401, and/or one or more image processors (GPUs) 413, etc., which may perform various suitable actions and processes according to executable instructions stored in a Read Only Memory (ROM) 402 or loaded from a storage 408 into a Random Access Memory (RAM) 403. The communication portion 412 may include, but is not limited to, a network card, which may include, but is not limited to, a IB (Infiniband) network card.

The processor may communicate with the rom 402 and/or the ram 403 to execute the executable instructions, and is connected to the communication unit 412 through the bus 404, and communicates with other target devices through the communication unit 412, so as to perform operations corresponding to any of the methods set forth in the embodiments of the present application, for example: projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector; extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model; inputting the extracted feature tensor into a full connection layer to obtain a prediction result; comparing the obtained prediction result with a reference result; and optimizing the miRNA target gene prediction model based on the results of the alignment.

In addition, in the RAM 403, various programs and data required for device operation can also be stored. The CPU 401, ROM 402, and RAM 403 are connected to each other by a bus 404. In the case of RAM 403, ROM 402 is an optional module. The RAM 403 stores executable instructions or writes executable instructions into the ROM 402 at the time of execution, the executable instructions causing the processor 401 to execute operations corresponding to the communication method described above. An input/output interface (I/O interface) 405 is also connected to the bus 404. The communication unit 412 may be provided integrally or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and on a bus link.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output portion 407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage unit 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. The drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as needed.

It should be noted that the architecture shown in fig. 4 is only an alternative implementation, and in a specific practical process, the number and types of components in fig. 4 may be selected, deleted, added or replaced according to actual needs; in the setting of different functional components, implementation manners such as separation setting or integration setting may be adopted, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication portion 412 may be separately set, may be integrally set on the CPU or the GPU, and the like. Such alternative embodiments fall within the scope of the present disclosure.

In particular, the process described with reference to flowchart 1 may be implemented as a computer program product according to the present application. For example, the present application proposes a computer program product comprising computer readable instructions which, when executed by a processor, implement the operations of: projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector; extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model; inputting the extracted feature tensor into a full connection layer to obtain a prediction result; comparing the obtained prediction result with a reference result; and optimizing the miRNA target gene prediction model based on the results of the alignment.

In such embodiments, the computer program product may be downloaded and installed from a network via the communications portion 409 and/or read and installed from the removable medium 411. The above-described functions defined in the method of the present application are performed when the computer program product is executed by a Central Processing Unit (CPU) 401.

The technical solutions of the present application may be implemented in many ways. For example, the techniques of this application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The order of steps used to describe the method is provided only for the purpose of more clearly describing the technical solution. The method steps of the present application are not limited to the order specifically described above unless specifically limited. Furthermore, in some embodiments, the present application may also be implemented as a storage medium storing a computer program product.

Fig. 5 shows a graph comparing the results of target gene prediction using the miRNA target gene prediction model trained according to the embodiments of the present application alone, and the results of target gene prediction using the miRNA target gene prediction model trained according to the embodiments of the present application and the miRNA target gene database, with the results of target gene prediction using existing target gene prediction software.

As shown in fig. 5, miRNA target gene predictions were performed on the ENCORI database, miRNet database, and miRTarbase database, respectively. Among the predicted results made on each database, from left to right are the results predicted with PITA, the results predicted with miranda, the results predicted with only the miRNA target gene prediction model according to the embodiment of the present application, and the results predicted with both the miRNA target gene prediction model according to the embodiment of the present application and the miRNA target gene database, respectively. It should be noted that, the miRNA target gene prediction model according to the embodiment of the present application is trained based on the screened gene data in the existing database, and the total database in fig. 5 includes the data not screened into positive data in the existing database. Thus, fig. 5 shows that the results predicted using both the miRNA target gene prediction model and the miRNA target gene database according to the embodiment of the present application are not 100%, but have less deviation than 100%. As can be seen from fig. 5, the prediction accuracy of the miRNA target gene prediction model trained according to the embodiment of the present application has exceeded the conventional target gene prediction software, and the prediction effect obtained by combining the miRNA target gene database and the miRNA target gene prediction model has far exceeded the conventional target gene prediction software.

The above description is merely illustrative of the implementations of the application and of the principles of the technology applied. It should be understood by those skilled in the art that the scope of protection referred to in this application is not limited to the specific combination of the above technical features, but also encompasses other technical solutions formed by any combination of the above technical features or their equivalents without departing from the technical concept. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A training method of a miRNA target gene prediction model, the training method comprising:

projecting input data formed by splicing the sequence of miRNA and the sequence of mRNA into an A base space, a U base space, a G base space and a C base space, thereby obtaining an A base vector, a U base vector, a G base vector and a C base vector;

extracting feature tensors from the A base vector, the U base vector, the G base vector and the C base vector by using a convolution layer, an activation layer and a pooling layer which are sequentially connected with four stages of a miRNA target gene prediction model;

inputting the extracted feature tensor into a full connection layer to obtain a prediction result;

comparing the obtained prediction result with a reference result; and

optimizing the miRNA target gene prediction model based on the result of the comparison.

2. The training method of claim 1, wherein the sequence of the miRNA and the sequence of the mRNA comprise sequences of the miRNA and sequences of the mRNA that interact with each other in a positive training set and sequences of the miRNA and sequences of the mRNA that are randomly generated in a negative training set, wherein:

the sequence of the miRNA and the sequence of the mRNA which are interacted with each other in the positive training set are the sequence of the miRNA which is extracted from at least one database in ENCORI, miRDB, miRTarBase, miRNet, miRWalk and verified by low-flux experiments and the sequence of the corresponding mRNA;

the sequence of the randomly generated mirnas and the sequence of the mrnas in the negative training set are randomly generated sequences, and the randomly generated sequences exclude the sequence of the mirnas and the sequence of the mrnas in the positive training set that interacted with each other and the sequence of the mirnas and the sequence of the mrnas that interacted predicted by miRanda, RNAhybrid and PITA.

3. The training method of claim 2, wherein the logarithm of the sequence of the miRNA and the sequence of the mRNA that interact with each other in the positive training set is the same as the logarithm of the sequence of the miRNA and the sequence of the mRNA that are randomly generated in the negative training set.

4. The training method of claim 1, wherein the four stages of sequentially connected convolutional layer, active layer, and pooling layer comprise:

a first stage with 16 4-channel convolution kernels, a relu activation function, and a max pooling layer with window 2;

a second stage with 32 16-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2;

a third stage with 64 32-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2; and

a fourth stage with 128 64-channel convolution kernels, a relu activation function, and a max-pooling layer with window 2.

5. The training method of claim 1, wherein inputting the extracted feature tensor into the fully connected layer to obtain the prediction result comprises:

the extracted feature tensor is reduced to be a one-dimensional first vector;

randomly pruning half of the elements from the first vector to form a second vector;

inputting the second vector to a first fully connected layer with relu as an activation function to obtain a third vector;

randomly pruning half of the elements from the third vector to form a fourth vector;

the fourth vector is input to a second fully connected layer with sigmoid as an activation function to obtain the prediction result.

6. The training method of claim 1, wherein optimizing the miRNA target gene prediction model based on the results of the alignment comprises:

and correcting parameters of each layer of the miRNA target gene prediction model by using the cross entropy of the two classes as a loss function and using an Adam optimization function.

7. A training system for a miRNA target gene prediction model, the training system comprising:

a memory storing executable instructions; and

one or more processors in communication with the memory to execute the executable instructions to:

comparing the obtained prediction result with a reference result; and

8. A computer-readable storage medium for training of miRNA target gene prediction models, the computer-readable storage medium storing executable instructions executable by one or more processors to perform the following operations:

comparing the obtained prediction result with a reference result; and

9. A method for predicting a miRNA target gene, the method comprising:

inputting a target miRNA sequence and a target mRNA sequence into a miRNA target gene database integrated based on an existing database to query whether the target miRNA sequence interacts with the target mRNA sequence;

in response to not querying the miRNA target gene database for interaction of the target miRNA sequence with the target mRNA sequence, inputting the target miRNA sequence and the target mRNA sequence into a miRNA target gene prediction model trained in accordance with the training method of claim 1 to predict whether the target miRNA sequence interacts with the target mRNA sequence.

10. The method of claim 9, wherein the miRNA target gene database integrated based on the existing database comprises:

the sequence of miRNAs and the corresponding sequences of mRNAs which are extracted from at least one database in ENCORI, miRDB, miRTarBase, miRNet, miRWalk and are verified by low-throughput experiments and interact with each other; and

the sequence of miRNA which passes through the high-throughput sequencing verification after the screening of a preset threshold value and the sequence of corresponding mRNA.