CN111370055B

CN111370055B - Intron retention prediction model establishment method and prediction method thereof

Info

Publication number: CN111370055B
Application number: CN202010146731.2A
Authority: CN
Inventors: 李洪东; 郑剑涛; 林翠香
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2023-05-23
Anticipated expiration: 2040-03-05
Also published as: CN111370055A

Abstract

The invention discloses a method for establishing an intron retention prediction model, which comprises the steps of collecting simulation data and real data related to intron retention; defining all independent inclusion subsets in the genome to be combined as standard templates; acquiring and processing an intron sequence reading distribution mode picture data set in the obtained simulation data to obtain a processed data set; dividing the processed data set into a training set and a testing set according to a set proportion; and training the neural network model by adopting a training set to obtain a finally established neural network intron retention prediction model. The invention also discloses a prediction method comprising the intron reservation prediction model establishment method. The method can visualize and forecast the introns based on the distribution mode of the introns retention reading, and has high reliability and good accuracy.

Description

Intron retention prediction model establishment method and prediction method thereof

Technical Field

The invention specifically designs an intron retention prediction model establishment method and a prediction method thereof.

Background

Intron retention is one of the alternative splicing, meaning that introns in the pre-mRNA are not spliced out and remain in the mature mRNA. Intronic retention was previously thought to be the result of mis-splicing, with less concern. There have been many recent studies showing that: intron retention is associated with regulation of gene expression and complex diseases (e.g., alzheimer's disease); and with the development of high throughput sequencing technology, many methods for intron retention detection have been proposed, with the emphasis on iREAD and IRFinder. Where iREAD detects intron retention by calculating entropy values assuming that the reads of intron retention are evenly distributed, the corresponding filter criteria are more stringent. irFinder detects intron retention by calculating the proportion of IR-ratio indicating the presence of introns in transcripts.

Although the above method has been successfully applied to the real environment, analysis according to sequence characteristics is more or less limited by deviations possibly caused by intron retention, which results in insufficient robustness of the method, so that the reliability of the current method is not high, and the development of the related technology is restricted.

Disclosure of Invention

The invention aims to provide an intron retention prediction model establishment method with high reliability and good accuracy.

The second object of the present invention is to provide a prediction method including the intron retention prediction model establishment method.

The method for establishing the intron retention prediction model provided by the invention comprises the following steps:

s1, collecting analog data and real data related to intron reservation;

s2, defining all independent inclusion subsets in the genome to be combined as a standard template;

s3, acquiring an intron sequence reading distribution mode picture data set in the simulation data obtained in the step S1, and preprocessing to obtain a processed data set;

s4, dividing the processed data set obtained in the step S3 into a training set and a testing set according to a set proportion;

s5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain the finally established neural network intron retention prediction model.

The method for establishing the intron retention prediction model further comprises the following steps:

s6, calculating evaluation parameters of the neural network intron-reserved prediction model on the test set obtained in the step S4 according to the neural network intron-reserved prediction model obtained in the step S5;

s7, acquiring an intron sequence reading distribution mode picture test set of the real data obtained in the step S1;

s8, predicting an intron retention result on the test set obtained in the step S7 according to the neural network intron retention prediction model obtained in the step S5, so as to obtain a predicted intron retention set;

s9, acquiring 5 'end sequences of W1 bases and N1 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 5' end sequences are W1+N1 bases in total;

s10, acquiring 3 '-end sequences of W2 bases and N2 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 3' -end sequences are W2+N2 bases;

s11, calculating the strength of the splice site according to the 5 'end sequence of the W1+N1 base obtained in the step S9 and the 3' end sequence of the W2+N2 base obtained in the step S10, so as to obtain a 5 'end average splice site strength value and a 3' end average splice site strength value;

s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11.

The step S1 is to collect analog data and real data related to the retention of the introns, specifically, to generate a analog data sequence file SIMU30 containing the determined number of the introns by adopting a BEER algorithm; the sequencing depth of the SIMU30 is three tens of millions, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP from Tau and APP mouse model studies of the accelerated drug co-program for alzheimer's disease, with a sequencing depth of one hundred million and a read length of 101 bases.

The defining of all independent inclusion subsets in the genome as described in step S2 is performed by combining the independent inclusion subsets as standard templates, specifically by:

A. extracting all Independent inclusion subsets indepent_introns from the release-75 version of the annotated gtf file of the GRCm38 mouse genome; the definition of the independent introns is introns that do not overlap with any homotypic exons;

B. and D, merging introns with overlapping coordinate intervals in the Independent introns subset Indapent_introns obtained in the step A by taking a gene as a unit to obtain a final Independent introns set.

All Independent introns are obtained by extracting all Independent introns from the indepent_introns described in step a, specifically combining all exons in one chromosome and then deleting all exons from the gene region.

The step S3 is to acquire the picture data set with the distribution mode of the internal sequence reading set in the simulation data obtained in the step S1, and to pre-process the picture data set to obtain a processed data set, specifically, to acquire the data set and perform data by adopting the following steps:

a. performing IGV (IGV-based visual) on each intron in the SIMU30 of the simulation data sequence file obtained in the step S1 to obtain a preliminary visual image;

b. respectively storing 20 bases left and right of the 5 'end and the 3' end of each intron, and two sections of sequence visual images with the length of 40 bases in total; the height of the visual image is 100mm, and the height of the bar chart representing the abundance of the base is standardized;

c. cutting out a part of the whole image, which is 131-231 pixels long and 280-1070 pixels long, from the image obtained in the step b;

d. and c, transversely combining the images obtained by cutting in the step c, thereby obtaining a final processed data set.

In step S4, the processed data set obtained in step S3 is divided into a training set and a testing set according to a set proportion, specifically, in the simulated data sequence file SIMU30 obtained in step S1, introns with total sequence reads greater than a first set value, FPKM (number of fragments matched to every kilobases in a gene in every million reads, fragments Per Kilobase Million) greater than a second set value and continuous reads greater than a third set value are defined as positive samples, and the remaining introns are negative samples; then randomly extracting X2 positive samples and X2 negative samples from the positive and negative samples to form a final data set; dividing the data set into a training set and a testing set according to a set proportion; x2 is a positive integer.

The neural network model in step S5 is specifically a VGG16 network structure model.

And (5) training the neural network model by adopting the training set obtained in the step (S4) in the step (S5) so as to obtain a finally established neural network intron retention prediction model, wherein the training model is specifically obtained by adopting the following steps:

(1) Obtaining a VGG16 network structure model trained on an ImageNet task and a corresponding weight parameter file; the network structure model comprises 13 convolution layers;

(2) Loading the network and the weight obtained in the step (1) as a pre-training network, and freezing the network to ensure that the network does not participate in training;

(3) Defining a two-class network, and training on the training set obtained in the step S4; the two classification networks have 3 layers, the first 2 layers are full-connection layers, the number of neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer used for two classifications;

(4) After training the classification network, thawing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;

(5) The parameters of the model training process are set as follows:

the total number of parameters for model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrainable parameters is 700 ten thousand;

the loss function is a two-class cross entropy loss, and the calculation formula is

Where i is each sample, t _i A real label of the sample i; y is _i A predictive tag for sample i;

the optimizer is RMSprop, and the learning rate is 2e ^-5 The iteration number is 30;

the evaluation index is accuracy, and the calculation formula is:

wherein Truepositive is the number of samples predicted to be positive and true to be positive; tuneplus is the number of samples predicted negative and truly negative; pulsamps is the total number of samples;

setting a ReduceLROnPlateeau to monitor the learning rate every 2 iterations, and if the learning rate is monitored to be not reduced, adjusting the learning rate to be reduced by 50%;

if the evaluation index accuracy is not reduced in 10 iterations, stopping the iterations in advance.

And (3) calculating the evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step (S4), specifically calculating the AUC value of the neural network intron retention prediction model on the test set obtained in the step (S4).

Step S7, obtaining an intron sequence reading distribution mode picture test set of the real data obtained in the step S1, namely inputting a sequence file APP of the real data obtained in the step S1 into a prediction tool iREAD and a prediction tool IRFinder to respectively obtain two groups of intron reserved prediction sets IR1 and IR2; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; and then, performing IGV (IGV) visualization, picture clipping and merging operation on each intron coordinate in the intersection IC, so as to obtain an intron sequence reading distribution mode picture test set real_test of the real data.

The 5 'end sequence of W1+N1 bases obtained in the step S9 and the 3' end sequence of W2+N2 bases obtained in the step S10 are used for calculating the strength of the splice site, so that a 5 'end average splice site strength value and a 3' end average splice site strength value are obtained, specifically, a 5 'end sequence score5ss sequence set obtained in the step S9 and a 3' end sequence score3ss sequence set obtained in the step S10 are input into a MaxEntScan model, and scoring is carried out by adopting a maximum entropy model, so that a given splice site strength value is obtained; and then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged to obtain a final 5 'end average splice site intensity value and a 3' end average splice site intensity value.

And (3) evaluating the neural network intron retention prediction model established in the step (5) according to the 5 'average splice site intensity value and the 3' average splice site intensity value obtained in the step (11) in the step (12), wherein the prediction effect of the neural network intron retention prediction model is better if the 5 'average splice site intensity value and the 3' average splice site intensity value of the neural network intron retention prediction model are smaller.

The invention also provides a prediction method comprising the intron reservation prediction model establishment method, and the method specifically comprises the following steps:

s13, predicting an intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.

The method for establishing the intronic retention prediction model and the prediction method thereof provided by the invention can predict the intronic retention more commonly and easily by adopting the intronic retention deep learning prediction method based on the intronic retention reading distribution mode; based on the distribution mode of the introne reserved reading, the deep learning model is combined to construct and transfer the learning application, the knowledge structure of the large-scale image classification task is transferred, and the learning effect of the introne reserved prediction task is completed and improved; meanwhile, the prediction effect evaluation is carried out on a real data set without gold standard, and the average splice site strength is calculated on the 5 'and 3' end sequences of the predicted intron retention sequence to measure the overall prediction effect; therefore, the method can visualize and predict the introns based on the distribution mode of the introns retention reading, and has high reliability and good accuracy.

Drawings

FIG. 1 is a flow chart of the method for constructing the intron retention prediction model according to the present invention.

FIG. 2 is a graphical representation of the results of the distribution pattern visualization of the intron retention readings of the present invention.

Fig. 3 is a schematic structural diagram of a deep learning model VGG16 according to the present invention.

FIG. 4 is a flow chart of a prediction method according to the present invention.

Detailed Description

FIG. 1 is a flow chart of the method for constructing the intron retention prediction model according to the present invention: the method for establishing the intron retention prediction model provided by the invention comprises the following steps:

s1, collecting analog data and real data related to intron reservation; specifically, a BEER algorithm is adopted to generate a simulation data sequence file SIMU30 containing the number of the determined introns; the sequencing depth of the SIMU30 is three tens of millions, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model study of the Alzheimer's disease acceleration drug cooperation project, the sequencing depth is one hundred million, and the reading length is 101 bases;

s2, defining all independent inclusion subsets in the genome to be combined as a standard template; the invention is particularly applicable to mice, so that the genome may be a mouse genome; the method is characterized by comprising the following steps:

wherein, all Independent introns are extracted from the Independent introns subset indepent_introns, specifically, all exons in one chromosome are combined, and then all exons are deleted from a gene region, so that all Independent introns are obtained;

B. in the Independent introns subset Inndepend_introns obtained in the step A, merging introns with overlapping coordinate intervals by taking a gene as a unit to obtain a final Independent introns subset intron cluster;

s3, acquiring an intron sequence reading distribution mode picture data set in the simulation data obtained in the step S1, and preprocessing to obtain a processed data set; the method comprises the following steps of obtaining a data set and performing data:

b. because the length of each intron is variable and the difference is extremely large, 20 bases left and right of the 5 'end and the 3' end of each intron are respectively stored, and two-segment sequence visual images with the length of 40 bases are respectively stored; the height of the visual image is 100mm, and the height of the bar chart representing the abundance of the base is standardized;

c. for the image obtained in the step b, the visualized image of the single-segment sequence is originally 621 pixels long and 1150 pixels long, so that the part of the whole image with the length of 131-231 pixels and the part with the length of 280-1070 pixels are cut;

d. transversely combining the images obtained by cutting in the step c, so as to obtain a final processed data set; the visual result is shown in fig. 2;

s4, dividing the processed data set obtained in the step S3 into a training set and a testing set according to a set proportion; specifically, in the analog data sequence file SIMU30 obtained in step S1, an intron whose total sequence reading is greater than a first set value (e.g., 10), whose FPKM (number of fragments matching every kilobases in a gene per million reads, fragments Per Kilobase Million) is greater than a second set value (e.g., 0.3), whose consecutive reads are greater than a third set value (e.g., 1) is defined as a positive sample, and whose remaining introns are negative samples; then randomly extracting X2 (such as 5000) positive samples and X2 negative samples from the positive and negative samples to form a final data set; then dividing the data set into a training set and a testing set according to a set proportion (such as 7:3); x2 is a positive integer.

S5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain a finally established neural network intron retention prediction model; in practice, the predictive model is preferably a VGG16 model; and when VGG16 is selected as a prediction model, the model can be trained by adopting the following steps:

(1) Obtaining a VGG16 network structure model (shown in figure 3) trained on an ImageNet task and a corresponding weight parameter file; the network structure model comprises 13 convolution layers;

(5) The parameters of the model training process are set as follows:

/>

the evaluation index is accuracy, and the calculation formula is:

if the evaluation index accuracy is not reduced in 10 iterations, stopping the iterations in advance

S6, calculating evaluation parameters (preferably AUC values) of the neural network intron-preservation prediction model on the test set obtained in the step S4 according to the neural network intron-preservation prediction model obtained in the step S5;

s7, acquiring an intron sequence reading distribution mode picture test set of the real data obtained in the step S1; specifically, inputting a sequence file APP of real data obtained in the step S1 into a prediction tool iREAD and a prediction tool IRFinder to respectively obtain two groups of intron reserved prediction sets IR1 and IR2; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; then, carrying out operations such as IGV (IGV) visualization, picture cutting and merging on each intron coordinate in the intersection IC so as to obtain an intron sequence reading distribution mode picture test set real_test of real data;

s11, calculating the strength of the splice site according to the 5 'end sequence of the W1+N1 base obtained in the step S9 and the 3' end sequence of the W2+N2 base obtained in the step S10, so as to obtain a 5 'end average splice site strength value and a 3' end average splice site strength value; inputting the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 into a MaxEntScan model, and scoring by adopting a maximum entropy model so as to obtain a given splice site strength value; then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged, so that a final 5 'end average splice site intensity value and a 3' end average splice site intensity value are obtained;

s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the average splice site intensity value of the 5 'end and the average splice site intensity value of the 3' end of the neural network intron retention prediction model are smaller, the prediction effect of the neural network intron retention prediction model is better.

The method of the invention is verified as follows:

the invention is evaluated on the simulation data SIMU30 and the real data set APP, while the tools compared to the invention are iREAD and irFinder.

1) SIMU30 analog dataset experimental analysis

For 3000 test set samples of SIMU30 simulation data, the predicted Accuracy of the present invention thereon reached 0.925 and the AUC reached 0.975;

2) APP real dataset experimental analysis

Because the real data lacks gold standard, on one hand, only the predictive label of other methods can be used as the real label, and the difference between the AUC of the VGG16 model and other methods is tested; on the other hand, other evaluation indexes can be customized to verify the effectiveness of the invention. In terms of AUC evaluation, the comparison of the VGG16 model of the invention with iREAD and IRFinder after predicting real data picture test set real_test is shown in Table 1.real_test is 68326 samples, when iREAD is used as a gold standard, the positive sample number is 2816, and the negative sample number is 65510, and the AUC of the VGG16 model of the invention is superior to IRFinder. When IRFinder is used as the gold standard, the positive sample number is 19044, and the negative sample number is 49282, and the invention is also superior to iREAD.

TABLE 1 AUC evaluation results of the invention with iREAD and IRFinder schematic tables

In addition, the invention also defines the 5 'end and 3' end splice site intensities to measure the VGG16 model prediction effect, and the lower the average splice site intensity is, the better the model overall prediction effect is. The results of the average splice site strength evaluation are shown in Table 2.

TABLE 2 Table schematically shows the results of the evaluation of the average splice site intensities of the invention with iREAD and IRFinder

From the results in table 2, while the results of the present invention are slightly worse than IRFinder and iREAD in terms of average splice site strength, it is noted that as the number of introns involved in calculating the average splice site strength increases, the average splice site strength of IRFinder and iREAD increases and the present invention decreases. Therefore, the VGG16 model designed by the invention is superior to IRFinder and iREAD in robustness.

FIG. 4 is a schematic flow chart of the prediction method of the present invention: the prediction method comprising the intron retention prediction model establishment method provided by the invention specifically comprises the following steps:

s2, defining all independent inclusion subsets in the genome to be combined as a standard template; the method is characterized by comprising the following steps:

(5) The parameters of the model training process are set as follows:

/>

the evaluation index is accuracy, and the calculation formula is:

s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the average splice site strength value of the 5 'end and the average splice site strength value of the 3' end of the neural network intron retention prediction model are smaller, the prediction effect of the neural network intron retention prediction model is better;

Claims

1. An intron retention prediction model building method comprises the following steps:

d. transversely combining the images obtained by cutting in the step c, so as to obtain a final processed data set;

2. The method for constructing an intron retention prediction model according to claim 1, further comprising the steps of:

3. The method for constructing an intron retention prediction model according to claim 2, wherein the defining of all independent intronic subsets in the genome in step S2 is performed by combining the independent intronic subsets as standard templates, specifically by:

4. The method of claim 3, wherein the step A extracts all the Independent introns indet_introns, specifically by combining all the exons in one chromosome, and then deleting all the exons from the gene region, thereby obtaining all the Independent introns.

5. The method for creating an intron retention prediction model according to claim 4, wherein in step S4, the processed data set obtained in step S3 is divided into a training set and a test set according to a set proportion, specifically, in the analog data sequence file SIMU30 obtained in step S1, introns with total sequence reading greater than a first set value, FPKM greater than a second set value and continuous reading greater than a third set value are defined as positive samples, and the remaining introns are negative samples; then randomly extracting X2 positive samples and X2 negative samples from the positive and negative samples to form a final data set; dividing the data set into a training set and a testing set according to a set proportion; x2 is a positive integer.

6. The method of claim 5, wherein the neural network model in step S5 is a VGG16 network structure model.

7. The method for creating the intronic retention prediction model according to claim 6, wherein the training set obtained in step S4 is used to train the neural network model in step S5, so as to obtain the final created neural network intronic retention prediction model, specifically, the training model is used in the following steps:

(3) Defining a two-class network, and training on the training set obtained in the step S4; the two classification networks are divided into 3 layers, the first 2 layers are full-connection layers, the number of neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively; the last layer is a sigmoid layer used for two classifications;

(5) The parameters of the model training process are set as follows:

the evaluation index is accuracy, and the calculation formula is:

8. The method for building an intron retention prediction model according to claim 7, wherein the evaluation parameters of the neural network intron retention prediction model are calculated on the test set obtained in step S4 in step S6, and specifically the AUC values of the neural network intron retention prediction model are calculated on the test set obtained in step S4.

9. The method for establishing the intronic reserve prediction model according to claim 8, wherein in the step S7, the intronic sequence reading distribution mode picture test set of the real data obtained in the step S1 is obtained, specifically, the sequence file APP of the real data obtained in the step S1 is input into a prediction tool iREAD and a prediction tool irFinder to obtain two sets of intronic reserve prediction sets IR1 and IR2 respectively; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; and then, performing IGV (IGV) visualization, picture clipping and merging operation on each intron coordinate in the intersection IC, so as to obtain an intron sequence reading distribution mode picture test set real_test of the real data.

10. The method for constructing an intron retention prediction model according to claim 9, wherein the 5 'end sequence of w1+n1 bases obtained in step S9 and the 3' end sequence of w2+n2 bases obtained in step S10 described in step S11 are used for calculating splice site intensities, thereby obtaining a 5 'end average splice site intensity value and a 3' end average splice site intensity value, specifically, the 5 'end sequence score5ss sequence set obtained in step S9 and the 3' end sequence score3ss sequence set obtained in step S10 are input into a MaxEntScan model, and scoring is performed by using a maximum entropy model, thereby obtaining a given splice site intensity value; and then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged to obtain a final 5 'end average splice site intensity value and a 3' end average splice site intensity value.

11. The method for building an intron retention prediction model according to claim 10, wherein the step S12 is characterized in that the prediction effect of the neural network intron retention prediction model built in step S5 is better if the 5 'average splice site strength value and the 3' average splice site strength value of the neural network intron retention prediction model are smaller according to the 5 'average splice site strength value and the 3' average splice site strength value obtained in step S11.

12. A prediction method comprising the intronic retention prediction model building method according to any one of claims 1 to 11, specifically further comprising the steps of: