CN111370055A

CN111370055A - Intron retention prediction model establishing method and prediction method thereof

Info

Publication number: CN111370055A
Application number: CN202010146731.2A
Authority: CN
Inventors: 李洪东; 郑剑涛; 林翠香
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-07-03
Anticipated expiration: 2040-03-05
Also published as: CN111370055B

Abstract

The invention discloses a method for establishing an intron retention prediction model, which comprises the steps of collecting analog data and real data related to intron retention; defining all independent inclusion subsets in a genome to be combined as a standard template; acquiring a picture data set of a reading distribution mode of the subsequence with the content set in the obtained simulation data and processing the picture data set to obtain a processed data set; dividing the processed data set into a training set and a test set according to a set proportion; and training the neural network model by adopting a training set to obtain a finally established neural network intron retention prediction model. The invention also discloses a prediction method comprising the intron retention prediction model establishing method. The invention can carry out visualization and prediction on the intron based on the intron retention reading distribution mode, and has high reliability and good accuracy.

Description

Intron retention prediction model establishing method and prediction method thereof

Technical Field

The invention specifically designs an intron retention prediction model building method and a prediction method thereof.

Background

Intron retention is one of the alternative splicing, meaning that introns in the precursor mRNA are not spliced out and remain in the mature mRNA. Intron retention was previously thought to be the result of mis-splicing and received less attention. There have been many recent studies showing that: intron retention is associated with gene expression regulation and complex diseases (e.g., alzheimer's disease); with the development of high throughput sequencing technology, many methods for intron retention detection have been proposed, and are highlighted by iREAD and IRFinder. Wherein iREAD detects intron retention by computing entropy values assuming that the readings of intron retention are uniformly distributed, and the corresponding filtering criteria are relatively stringent. IRFinder then measures intron retention by calculating the proportion of IR-ratio indicative of the occurrence of an intron in a transcript.

Although the method has been successfully applied to a real environment, the analysis based on the sequence characteristics is more or less limited by the deviation possibly caused by the retention of the intron, so that the robustness of the method is insufficient, and the reliability of the current method is not high, thereby restricting the development of the related technology.

Disclosure of Invention

The invention aims to provide a method for establishing an intron retention prediction model with high reliability and high accuracy.

The invention also aims to provide a prediction method comprising the intron-retention prediction model building method.

The invention provides a method for establishing an intron retention prediction model, which comprises the following steps:

s1, collecting analog data and real data related to intron retention;

s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template;

s3, acquiring a picture data set with a reading distribution mode of the subsequence with the built-in subsequence set in the simulation data obtained in the step S1, and preprocessing the picture data set to obtain a processed data set;

s4, dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion;

and S5, training the neural network model by adopting the training set obtained in the step S4, thereby obtaining the finally established neural network intron retention prediction model.

The method for establishing the intron retention prediction model further comprises the following steps:

s6, calculating evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step S4 according to the neural network intron retention prediction model obtained in the step S5;

s7, acquiring an image test set of the reading distribution mode of the intron subsequence of the real data obtained in the step S1;

s8, predicting an intron retention result on the test set obtained in the step S7 according to the prediction model of the intron retention in the neural network obtained in the step S5, so as to obtain a predicted intron retention set;

s9, acquiring a 5' end sequence of W1+ N1 bases on the exon side W1 bases and the intron side N1 bases of the predicted intron retention set obtained in the step S8;

s10, acquiring a 3' end sequence of W2+ N2 bases on the exon side W2 bases and the intron side N2 bases of the predicted intron retention set obtained in the step S8;

s11, calculating the splice site intensity according to the 5 'end sequence of W1+ N1 bases obtained in the step S9 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value;

and S12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11.

Step S1, collecting simulation data and real data related to intron retention, specifically, generating a simulation data sequence file SIMU30 containing a determined number of introns by using a BER algorithm; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP from the study of Tau and APP mouse models of Alzheimer's disease accelerated drug cooperation projects, the sequencing depth is one hundred million, and the reading length is 101 bases.

All independent inclusion subsets in the defined genome described in step S2 are combined as a standard template, specifically defined by the following steps:

A. extracting all Independent intron sets independentintron from the annotated gtf file of release-75 version of the GRCm38 mouse genome; the independent intron is defined as an intron that does not overlap with any of the exons of the same type;

B. and D, combining the introns with the overlapped coordinate intervals by taking the gene as a unit in the Independent intron set Independent _ intron obtained in the step A to obtain the final Independent intron set intron cluster.

The step A of extracting all Independent intron sets independentlntron specifically comprises merging all exons in a chromosome, and then deleting all exons from a gene region, thereby obtaining all Independent introns.

Step S3, the step of obtaining the picture data set with the reading distribution pattern of the subsequence with inclusion set in the simulation data obtained in step S1, and performing preprocessing to obtain a processed data set, specifically, the step of obtaining the data set and performing data processing includes:

a. performing IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in the step S1 to obtain a primary visualized image;

b. respectively storing two segments of sequence visual images of 20 bases on the left and right of the 5 'end and the 3' end of each intron, wherein the length of the two segments of sequence visual images is 40 bases; the height of the visualized image is 100mm, and meanwhile, the height of a bar graph representing the abundance of the base is subjected to standardization processing;

c. b, cutting a part with the length of 131-231 pixels and a part with the length of 280-1070 pixels of the whole image for the image obtained in the step b;

d. and c, transversely combining the images obtained by cutting in the step c to obtain a final processed data set.

The step S4 of dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion is specifically that in the simulated data sequence file SIMU30 obtained in the step S1, an intron whose total sequence read is greater than a first set value, FPKM (number of Fragments matched to every thousand bases in a gene in each million reads, Fragments Per klobase million) is defined as a positive sample, and the remaining intron is a negative sample, where the number of Fragments matched to every thousand bases in the gene is greater than a second set value and the number of continuous reads is greater than a third set value; then randomly extracting X2 positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a test set according to a set proportion; x2 is a positive integer.

The neural network model in step S5 is specifically a VGG16 network structure model.

Step S5, training the neural network model by using the training set obtained in step S4, so as to obtain a finally established prediction model for preserving introns in the neural network, specifically training the model by using the following steps:

(1) obtaining a VGG16 network structure model which is trained on an ImageNet task and a corresponding weight parameter file; the network structure model worker comprises 13 convolution layers;

(2) loading the network and the weight obtained in the step (1) as a pre-training network, but freezing the network so as to ensure that the network does not participate in training;

(3) defining a two-classification network, and training on the training set obtained in the step S4; the two classification networks have 3 layers in total, the first 2 layers are full connection layers, the number of the neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer and is used for secondary classification;

(4) after the classification network is trained, unfreezing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;

(5) parameters for the model training process were set as follows:

the total parameter number of model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrained parameters is 700 thousand;

the loss function is binary cross entropy loss, and the calculation formula is

Where i is each sample, t_iIs the true label of sample i; y is_iIs a prediction tag for sample i;

the optimizer is RMSprop, and the learning rate is 2e^-5The number of iterations is 30;

the evaluation index is accuracy, and the calculation formula is as follows:

wherein Truepositive is the number of samples predicted to be positive and truly positive; turenergive is the number of samples predicted to be negative and truly negative; allsampies is the total number of samples;

setting the monitoring learning rate of reduceLROnPateau for every 2 iterations, and if the monitoring learning rate is not reduced, reducing the adjusting learning rate by 50%;

and if the evaluation index accuracy does not decrease in 10 iterations, stopping the iteration in advance.

In step S6, the evaluation parameters of the prediction model for preserving introns of the neural network are calculated on the test set obtained in step S4, specifically, AUC values of the prediction model for preserving introns of the neural network are calculated on the test set obtained in step S4.

Step S7, obtaining the intron sequence reading distribution pattern picture test set of the real data obtained in step S1, specifically, inputting the sequence file APP of the real data obtained in step S1 into a prediction tool ired and a prediction tool IRFinder, and obtaining two groups of intron retention prediction sets IR1 and IR2, respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; and then, carrying out IGV visualization, picture cutting and merging operation on each intron coordinate in the intersection IC, thereby obtaining the real data intron subsequence reading distribution mode picture test set real _ test.

Calculating splice site intensities according to the 5 'end sequence of W1+ N1 bases obtained in the step S11 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value, specifically, inputting the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 into a MaxEntScan model, and scoring by using a maximum entropy model so as to obtain a given splice site intensity value; and then averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value.

And S12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11, wherein the prediction effect of the neural network intron retention prediction model is better if the 5 'end average splice site intensity value and the 3' end average splice site intensity value of the neural network intron retention prediction model are smaller.

The invention also provides a prediction method comprising the intron retention prediction model building method, and specifically comprises the following steps:

and S13, predicting the intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.

According to the method for establishing the intron retention prediction model and the prediction method thereof, the intron retention deep learning prediction method based on the intron retention reading distribution mode can more generally predict the intron retention in an easily-interpreted manner; based on the intron retention reading distribution mode, the knowledge structure of the large-scale image classification task is migrated by combining deep learning model construction and migration learning application, and the learning effect of the intron retention prediction task is completed and improved; meanwhile, the prediction effect evaluation is carried out on a real data set without a gold standard, and the calculation of the average splice site strength of the 5 'end sequence and the 3' end sequence of the predicted intron retention sequence is proposed to measure the quality of the overall prediction effect; therefore, the method can be used for visualizing and predicting the intron based on the intron retention reading distribution mode, and is high in reliability and good in accuracy.

Drawings

FIG. 1 is a schematic flow chart of a method for building an intron-retention prediction model according to the present invention.

FIG. 2 is a diagram illustrating the result of visualizing the distribution pattern of intron-retained readings according to the present invention.

FIG. 3 is a schematic structural diagram of the deep learning model VGG16 of the present invention.

FIG. 4 is a flow chart of a prediction method according to the present invention.

Detailed Description

FIG. 1 is a schematic flow chart of a method for building an intron-retention prediction model according to the present invention: the invention provides a method for establishing an intron retention prediction model, which comprises the following steps:

s1, collecting analog data and real data related to intron retention; specifically, a BEER algorithm is adopted to generate a SIMU30 containing a simulation data sequence file with a determined number of introns; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model research of the Alzheimer disease accelerating drug cooperation project, wherein the sequencing depth is one hundred million, and the reading length is 101 basic groups;

s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template; the invention has particular application to mice, and thus the genome may be a mouse genome; the method is defined by the following steps:

wherein, extracting all Independent intron sets independentlntron, specifically combining all exons in a chromosome, and then deleting all exons from a gene region, thereby obtaining all Independent introns;

B. combining introns with overlapped coordinate intervals in the Independent intron set Independent _ intron obtained in the step A by taking the gene as a unit to obtain a final Independent intron set intron cluster;

s3, acquiring a picture data set with a reading distribution mode of the subsequence with the built-in subsequence set in the simulation data obtained in the step S1, and preprocessing the picture data set to obtain a processed data set; specifically, the following steps are adopted to obtain a data set and perform data:

b. because the length of each intron is indefinite and the difference is extremely large, two segments of sequence visualization images with the length of 40 bases are respectively stored at the left and right of the 5 'end and the 3' end of each intron; the height of the visualized image is 100mm, and meanwhile, the height of a bar graph representing the abundance of the base is subjected to standardization processing;

c. for the image obtained in the step b, the original longitudinal length of the single-segment sequence of the visual images is 621 pixels, the transverse length of the single-segment sequence is 1150 pixels, and therefore the part of the whole image with the longitudinal length of 131-231 pixels and the part with the transverse length of 280-1070 pixels are cut;

d. c, transversely combining the images obtained by cutting in the step c to obtain a final processed data set; the visualization results are shown in fig. 2;

s4, dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion; specifically, in the simulation data sequence file SIMU30 obtained in step S1, it is defined that the total sequence read is greater than a first set value (e.g., 10), FPKM (number of Fragments matched to each kilobase in a gene Per million reads, Fragments Per thousand base million) is greater than a second set value (e.g., 0.3), and the intron whose consecutive read is greater than a third set value (e.g., 1) is a positive sample, and the remaining introns are negative samples; then randomly extracting X2 (such as 5000) positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a testing set according to a set proportion (such as 7: 3); x2 is a positive integer.

S5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain a finally established neural network intron retention prediction model; in particular, the prediction model is preferably a VGG16 model; and when the VGG16 is selected as the prediction model, the following steps can be adopted to train the model:

(1) obtaining a VGG16 network structure model (as shown in FIG. 3) which is trained on an ImageNet task and a corresponding weight parameter file; the network structure model worker comprises 13 convolution layers;

(5) parameters for the model training process were set as follows:

the loss function is binary cross entropy loss, and the calculation formula is

the evaluation index is accuracy, and the calculation formula is as follows:

if the evaluation index accuracy is not reduced in 10 iterations, stopping the iteration in advance

S6, according to the prediction model of the intron retention in the neural network obtained in the step S5, calculating evaluation parameters (preferably AUC values) of the prediction model of the intron retention in the neural network on the test set obtained in the step S4;

s7, acquiring an image test set of the reading distribution mode of the intron subsequence of the real data obtained in the step S1; specifically, the sequence file APP of the real data obtained in the step S1 is input into a prediction tool iREAD and a prediction tool IRFinder, and two groups of intron retention prediction sets IR1 and IR2 are obtained respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; then, carrying out operations such as IGV visualization, picture cutting, merging and the like on each intron coordinate in the intersection IC, thereby obtaining an intron subsequence reading distribution mode picture test set real _ test of real data;

s11, calculating the splice site intensity according to the 5 'end sequence of W1+ N1 bases obtained in the step S9 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value; specifically, the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 are input into a MaxEntScan model, and a maximum entropy model is adopted for scoring, so that a given splice site intensity value is obtained; then, averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value;

s12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the value of the 5 'end average splice site intensity and the value of the 3' end average splice site intensity of the intron retention prediction model of the neural network are smaller, the prediction effect of the intron retention prediction model of the neural network is better.

The method of the invention was verified as follows:

the invention is evaluated on the simulation data SIMU30 and the real data set APP, while the tools compared with the invention are iREAD and IRFinder.

1) SIMU30 simulation dataset experimental analysis

For 3000 test set samples of SIMU30 simulation data, Accuracy predicted thereon by the present invention reached 0.925 and AUC reached 0.975;

2) APP true data set experimental analysis

Because the real data lack of gold standard, on one hand, the prediction labels of other methods can only be used as real labels, and the difference between the AUC of the VGG16 model and other methods is tested; on the other hand, other evaluation indexes can be customized to verify the effectiveness of the invention. In the aspect of AUC evaluation, the VGG16 model of the invention is compared with iREAD and IRFinder after predicting real data picture test set real _ test, see Table 1. The real _ test has 68326 samples, the positive sample number is 2816 and the negative sample number is 65510 when the iREAD is taken as the gold standard, and the AUC of the VGG16 model of the invention is superior to the IRFinder. The invention also outperforms iREAD when IRFinder is the gold standard, positive samples are 19044 and negative samples are 49282.

TABLE 1 AUC evaluation results of the present invention and iREAD and IRFinder are shown schematically

In addition, the invention also defines the splice site intensity of the 5 'end and the 3' end to measure the prediction effect of the VGG16 model, and the lower the average splice site intensity is, the better the overall prediction effect of the model is. The results of the evaluation of the average splice site strengths are shown in Table 2.

TABLE 2 evaluation results of the average splice site strengths of the present invention with iREAD and IRFinder are shown schematically

From the results in Table 2, although the results of the present invention are slightly inferior to IRFinder and iREAD in the average splice site strength, it is noted that the average splice site strength of IRFinder and iREAD is increased as the number of introns involved in calculating the average splice site strength is increased, and the present invention is decreased. Thus reflecting that the VGG16 model designed by the invention is superior to IRFinder and iREAD in robustness.

FIG. 4 is a schematic flow chart of the prediction method of the present invention: the prediction method provided by the invention comprises the intron retention prediction model building method, and specifically comprises the following steps:

s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template; the method is defined by the following steps:

(5) parameters for the model training process were set as follows:

the loss function is binary cross entropy loss, and the calculation formula is

the evaluation index is accuracy, and the calculation formula is as follows:

s12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the value of the 5 'end average splice site intensity and the value of the 3' end average splice site intensity of the intron retention prediction model of the neural network are smaller, the prediction effect of the intron retention prediction model of the neural network is better;

Claims

1. An intron retention prediction model building method comprises the following steps:

s1, collecting analog data and real data related to intron retention;

2. The method for building the intron-retention prediction model according to claim 1, further comprising the steps of:

3. The method for building the prediction model for intron retention according to claim 2, wherein the step S1 is performed by collecting simulation data and real data related to intron retention, specifically, by using BEER algorithm to generate the simulation data sequence file SIMU30 containing the determined number of introns; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP from the study of Tau and APP mouse models of Alzheimer's disease accelerated drug cooperation projects, the sequencing depth is one hundred million, and the reading length is 101 bases.

4. The method according to claim 3, wherein the step S2 is to combine all independent intron subsets in the defined genome as standard templates, and the method is defined by the following steps:

5. The method for building an intron-retention prediction model according to claim 4, wherein all Independent intron sets independentintron are extracted in step A, specifically, all exons in one chromosome are combined, and then all exons are deleted from the gene region, so as to obtain all Independent introns.

6. The method of claim 5, wherein the step S3 of obtaining the picture data set with the reading distribution pattern of the intron sequence set in the simulation data obtained in the step S1 and preprocessing the picture data set to obtain a processed data set comprises the following steps of:

7. The method of claim 6, wherein the step S4 is to divide the processed data set obtained in step S3 into a training set and a testing set according to a predetermined ratio, and specifically, in the simulated data sequence file SIMU30 obtained in step S1, the introns with the sequence total reading greater than the first set value, the FPKM greater than the second set value and the continuous reading greater than the third set value are defined as positive samples, and the remaining introns are negative samples; then randomly extracting X2 positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a test set according to a set proportion; x2 is a positive integer.

8. The method according to claim 7, wherein the neural network model of step S5 is a VGG16 network structure model.

9. The method of claim 8, wherein the step S5 of training the neural network model with the training set obtained in step S4 to obtain the finally established neural network intron-retention prediction model comprises the following steps:

(3) defining a two-classification network, and training on the training set obtained in the step S4; the two classification networks have 3 layers in total, the first 2 layers are full connection layers, the number of the neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively; the last layer is a sigmoid layer and is used for secondary classification;

(5) parameters for the model training process were set as follows:

the loss function is binary cross entropy loss, and the calculation formula is

the evaluation index is accuracy, and the calculation formula is as follows:

10. The method for building the intron retention prediction model according to claim 9, wherein the step S6 is to calculate the evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step S4, specifically calculate the AUC value of the neural network intron retention prediction model on the test set obtained in the step S4.

11. The method for building the intron retention prediction model according to claim 10, wherein the step S7 of obtaining the picture test set of the intron sequence reading distribution pattern of the real data obtained in the step S1 is to input the sequence file APP of the real data obtained in the step S1 into a prediction tool ired and a prediction tool IRFinder, so as to obtain two groups of intron retention prediction sets IR1 and IR2, respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; and then, carrying out IGV visualization, picture cutting and merging operation on each intron coordinate in the intersection IC, thereby obtaining the real data intron subsequence reading distribution mode picture test set real _ test.

12. The method for building an intron-retention prediction model according to claim 11, wherein the splice site intensities in step S11 are calculated according to the 5 'end sequence of W1+ N1 bases obtained in step S9 and the 3' end sequence of W2+ N2 bases obtained in step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value, specifically, the 5 'end sequence score5ss sequence set obtained in step S9 and the 3' end sequence score3ss sequence set obtained in step S10 are inputted into a MaxEntScan model, and scored using a maximum entropy model, so as to obtain a given splice site intensity value; and then averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value.

13. The method according to claim 12, wherein the neural network intron retention prediction model established in step S5 is evaluated according to the 5 'end average splice site strength value and the 3' end average splice site strength value obtained in step S11 in step S12, and specifically, if the 5 'end average splice site strength value and the 3' end average splice site strength value of the neural network intron retention prediction model are smaller, the prediction effect of the neural network intron retention prediction model is better.

14. A prediction method comprising the method for establishing the intron retention prediction model according to any one of claims 1 to 13, and specifically comprising the steps of: