CN111370055B - Intron retention prediction model establishment method and prediction method thereof - Google Patents

Intron retention prediction model establishment method and prediction method thereof Download PDF

Info

Publication number
CN111370055B
CN111370055B CN202010146731.2A CN202010146731A CN111370055B CN 111370055 B CN111370055 B CN 111370055B CN 202010146731 A CN202010146731 A CN 202010146731A CN 111370055 B CN111370055 B CN 111370055B
Authority
CN
China
Prior art keywords
intron
prediction model
training
introns
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010146731.2A
Other languages
Chinese (zh)
Other versions
CN111370055A (en
Inventor
李洪东
郑剑涛
林翠香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010146731.2A priority Critical patent/CN111370055B/en
Publication of CN111370055A publication Critical patent/CN111370055A/en
Application granted granted Critical
Publication of CN111370055B publication Critical patent/CN111370055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a method for establishing an intron retention prediction model, which comprises the steps of collecting simulation data and real data related to intron retention; defining all independent inclusion subsets in the genome to be combined as standard templates; acquiring and processing an intron sequence reading distribution mode picture data set in the obtained simulation data to obtain a processed data set; dividing the processed data set into a training set and a testing set according to a set proportion; and training the neural network model by adopting a training set to obtain a finally established neural network intron retention prediction model. The invention also discloses a prediction method comprising the intron reservation prediction model establishment method. The method can visualize and forecast the introns based on the distribution mode of the introns retention reading, and has high reliability and good accuracy.

Description

Intron retention prediction model establishment method and prediction method thereof
Technical Field
The invention specifically designs an intron retention prediction model establishment method and a prediction method thereof.
Background
Intron retention is one of the alternative splicing, meaning that introns in the pre-mRNA are not spliced out and remain in the mature mRNA. Intronic retention was previously thought to be the result of mis-splicing, with less concern. There have been many recent studies showing that: intron retention is associated with regulation of gene expression and complex diseases (e.g., alzheimer's disease); and with the development of high throughput sequencing technology, many methods for intron retention detection have been proposed, with the emphasis on iREAD and IRFinder. Where iREAD detects intron retention by calculating entropy values assuming that the reads of intron retention are evenly distributed, the corresponding filter criteria are more stringent. irFinder detects intron retention by calculating the proportion of IR-ratio indicating the presence of introns in transcripts.
Although the above method has been successfully applied to the real environment, analysis according to sequence characteristics is more or less limited by deviations possibly caused by intron retention, which results in insufficient robustness of the method, so that the reliability of the current method is not high, and the development of the related technology is restricted.
Disclosure of Invention
The invention aims to provide an intron retention prediction model establishment method with high reliability and good accuracy.
The second object of the present invention is to provide a prediction method including the intron retention prediction model establishment method.
The method for establishing the intron retention prediction model provided by the invention comprises the following steps:
s1, collecting analog data and real data related to intron reservation;
s2, defining all independent inclusion subsets in the genome to be combined as a standard template;
s3, acquiring an intron sequence reading distribution mode picture data set in the simulation data obtained in the step S1, and preprocessing to obtain a processed data set;
s4, dividing the processed data set obtained in the step S3 into a training set and a testing set according to a set proportion;
s5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain the finally established neural network intron retention prediction model.
The method for establishing the intron retention prediction model further comprises the following steps:
s6, calculating evaluation parameters of the neural network intron-reserved prediction model on the test set obtained in the step S4 according to the neural network intron-reserved prediction model obtained in the step S5;
s7, acquiring an intron sequence reading distribution mode picture test set of the real data obtained in the step S1;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the neural network intron retention prediction model obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring 5 'end sequences of W1 bases and N1 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 5' end sequences are W1+N1 bases in total;
s10, acquiring 3 '-end sequences of W2 bases and N2 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 3' -end sequences are W2+N2 bases;
s11, calculating the strength of the splice site according to the 5 'end sequence of the W1+N1 base obtained in the step S9 and the 3' end sequence of the W2+N2 base obtained in the step S10, so as to obtain a 5 'end average splice site strength value and a 3' end average splice site strength value;
s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11.
The step S1 is to collect analog data and real data related to the retention of the introns, specifically, to generate a analog data sequence file SIMU30 containing the determined number of the introns by adopting a BEER algorithm; the sequencing depth of the SIMU30 is three tens of millions, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP from Tau and APP mouse model studies of the accelerated drug co-program for alzheimer's disease, with a sequencing depth of one hundred million and a read length of 101 bases.
The defining of all independent inclusion subsets in the genome as described in step S2 is performed by combining the independent inclusion subsets as standard templates, specifically by:
A. extracting all Independent inclusion subsets indepent_introns from the release-75 version of the annotated gtf file of the GRCm38 mouse genome; the definition of the independent introns is introns that do not overlap with any homotypic exons;
B. and D, merging introns with overlapping coordinate intervals in the Independent introns subset Indapent_introns obtained in the step A by taking a gene as a unit to obtain a final Independent introns set.
All Independent introns are obtained by extracting all Independent introns from the indepent_introns described in step a, specifically combining all exons in one chromosome and then deleting all exons from the gene region.
The step S3 is to acquire the picture data set with the distribution mode of the internal sequence reading set in the simulation data obtained in the step S1, and to pre-process the picture data set to obtain a processed data set, specifically, to acquire the data set and perform data by adopting the following steps:
a. performing IGV (IGV-based visual) on each intron in the SIMU30 of the simulation data sequence file obtained in the step S1 to obtain a preliminary visual image;
b. respectively storing 20 bases left and right of the 5 'end and the 3' end of each intron, and two sections of sequence visual images with the length of 40 bases in total; the height of the visual image is 100mm, and the height of the bar chart representing the abundance of the base is standardized;
c. cutting out a part of the whole image, which is 131-231 pixels long and 280-1070 pixels long, from the image obtained in the step b;
d. and c, transversely combining the images obtained by cutting in the step c, thereby obtaining a final processed data set.
In step S4, the processed data set obtained in step S3 is divided into a training set and a testing set according to a set proportion, specifically, in the simulated data sequence file SIMU30 obtained in step S1, introns with total sequence reads greater than a first set value, FPKM (number of fragments matched to every kilobases in a gene in every million reads, fragments Per Kilobase Million) greater than a second set value and continuous reads greater than a third set value are defined as positive samples, and the remaining introns are negative samples; then randomly extracting X2 positive samples and X2 negative samples from the positive and negative samples to form a final data set; dividing the data set into a training set and a testing set according to a set proportion; x2 is a positive integer.
The neural network model in step S5 is specifically a VGG16 network structure model.
And (5) training the neural network model by adopting the training set obtained in the step (S4) in the step (S5) so as to obtain a finally established neural network intron retention prediction model, wherein the training model is specifically obtained by adopting the following steps:
(1) Obtaining a VGG16 network structure model trained on an ImageNet task and a corresponding weight parameter file; the network structure model comprises 13 convolution layers;
(2) Loading the network and the weight obtained in the step (1) as a pre-training network, and freezing the network to ensure that the network does not participate in training;
(3) Defining a two-class network, and training on the training set obtained in the step S4; the two classification networks have 3 layers, the first 2 layers are full-connection layers, the number of neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer used for two classifications;
(4) After training the classification network, thawing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) The parameters of the model training process are set as follows:
the total number of parameters for model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrainable parameters is 700 ten thousand;
the loss function is a two-class cross entropy loss, and the calculation formula is
Figure BDA0002401015310000051
Where i is each sample, t i A real label of the sample i; y is i A predictive tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e -5 The iteration number is 30;
the evaluation index is accuracy, and the calculation formula is:
Figure BDA0002401015310000052
wherein Truepositive is the number of samples predicted to be positive and true to be positive; tuneplus is the number of samples predicted negative and truly negative; pulsamps is the total number of samples;
setting a ReduceLROnPlateeau to monitor the learning rate every 2 iterations, and if the learning rate is monitored to be not reduced, adjusting the learning rate to be reduced by 50%;
if the evaluation index accuracy is not reduced in 10 iterations, stopping the iterations in advance.
And (3) calculating the evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step (S4), specifically calculating the AUC value of the neural network intron retention prediction model on the test set obtained in the step (S4).
Step S7, obtaining an intron sequence reading distribution mode picture test set of the real data obtained in the step S1, namely inputting a sequence file APP of the real data obtained in the step S1 into a prediction tool iREAD and a prediction tool IRFinder to respectively obtain two groups of intron reserved prediction sets IR1 and IR2; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; and then, performing IGV (IGV) visualization, picture clipping and merging operation on each intron coordinate in the intersection IC, so as to obtain an intron sequence reading distribution mode picture test set real_test of the real data.
The 5 'end sequence of W1+N1 bases obtained in the step S9 and the 3' end sequence of W2+N2 bases obtained in the step S10 are used for calculating the strength of the splice site, so that a 5 'end average splice site strength value and a 3' end average splice site strength value are obtained, specifically, a 5 'end sequence score5ss sequence set obtained in the step S9 and a 3' end sequence score3ss sequence set obtained in the step S10 are input into a MaxEntScan model, and scoring is carried out by adopting a maximum entropy model, so that a given splice site strength value is obtained; and then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged to obtain a final 5 'end average splice site intensity value and a 3' end average splice site intensity value.
And (3) evaluating the neural network intron retention prediction model established in the step (5) according to the 5 'average splice site intensity value and the 3' average splice site intensity value obtained in the step (11) in the step (12), wherein the prediction effect of the neural network intron retention prediction model is better if the 5 'average splice site intensity value and the 3' average splice site intensity value of the neural network intron retention prediction model are smaller.
The invention also provides a prediction method comprising the intron reservation prediction model establishment method, and the method specifically comprises the following steps:
s13, predicting an intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.
The method for establishing the intronic retention prediction model and the prediction method thereof provided by the invention can predict the intronic retention more commonly and easily by adopting the intronic retention deep learning prediction method based on the intronic retention reading distribution mode; based on the distribution mode of the introne reserved reading, the deep learning model is combined to construct and transfer the learning application, the knowledge structure of the large-scale image classification task is transferred, and the learning effect of the introne reserved prediction task is completed and improved; meanwhile, the prediction effect evaluation is carried out on a real data set without gold standard, and the average splice site strength is calculated on the 5 'and 3' end sequences of the predicted intron retention sequence to measure the overall prediction effect; therefore, the method can visualize and predict the introns based on the distribution mode of the introns retention reading, and has high reliability and good accuracy.
Drawings
FIG. 1 is a flow chart of the method for constructing the intron retention prediction model according to the present invention.
FIG. 2 is a graphical representation of the results of the distribution pattern visualization of the intron retention readings of the present invention.
Fig. 3 is a schematic structural diagram of a deep learning model VGG16 according to the present invention.
FIG. 4 is a flow chart of a prediction method according to the present invention.
Detailed Description
FIG. 1 is a flow chart of the method for constructing the intron retention prediction model according to the present invention: the method for establishing the intron retention prediction model provided by the invention comprises the following steps:
s1, collecting analog data and real data related to intron reservation; specifically, a BEER algorithm is adopted to generate a simulation data sequence file SIMU30 containing the number of the determined introns; the sequencing depth of the SIMU30 is three tens of millions, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model study of the Alzheimer's disease acceleration drug cooperation project, the sequencing depth is one hundred million, and the reading length is 101 bases;
s2, defining all independent inclusion subsets in the genome to be combined as a standard template; the invention is particularly applicable to mice, so that the genome may be a mouse genome; the method is characterized by comprising the following steps:
A. extracting all Independent inclusion subsets indepent_introns from the release-75 version of the annotated gtf file of the GRCm38 mouse genome; the definition of the independent introns is introns that do not overlap with any homotypic exons;
wherein, all Independent introns are extracted from the Independent introns subset indepent_introns, specifically, all exons in one chromosome are combined, and then all exons are deleted from a gene region, so that all Independent introns are obtained;
B. in the Independent introns subset Inndepend_introns obtained in the step A, merging introns with overlapping coordinate intervals by taking a gene as a unit to obtain a final Independent introns subset intron cluster;
s3, acquiring an intron sequence reading distribution mode picture data set in the simulation data obtained in the step S1, and preprocessing to obtain a processed data set; the method comprises the following steps of obtaining a data set and performing data:
a. performing IGV (IGV-based visual) on each intron in the SIMU30 of the simulation data sequence file obtained in the step S1 to obtain a preliminary visual image;
b. because the length of each intron is variable and the difference is extremely large, 20 bases left and right of the 5 'end and the 3' end of each intron are respectively stored, and two-segment sequence visual images with the length of 40 bases are respectively stored; the height of the visual image is 100mm, and the height of the bar chart representing the abundance of the base is standardized;
c. for the image obtained in the step b, the visualized image of the single-segment sequence is originally 621 pixels long and 1150 pixels long, so that the part of the whole image with the length of 131-231 pixels and the part with the length of 280-1070 pixels are cut;
d. transversely combining the images obtained by cutting in the step c, so as to obtain a final processed data set; the visual result is shown in fig. 2;
s4, dividing the processed data set obtained in the step S3 into a training set and a testing set according to a set proportion; specifically, in the analog data sequence file SIMU30 obtained in step S1, an intron whose total sequence reading is greater than a first set value (e.g., 10), whose FPKM (number of fragments matching every kilobases in a gene per million reads, fragments Per Kilobase Million) is greater than a second set value (e.g., 0.3), whose consecutive reads are greater than a third set value (e.g., 1) is defined as a positive sample, and whose remaining introns are negative samples; then randomly extracting X2 (such as 5000) positive samples and X2 negative samples from the positive and negative samples to form a final data set; then dividing the data set into a training set and a testing set according to a set proportion (such as 7:3); x2 is a positive integer.
S5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain a finally established neural network intron retention prediction model; in practice, the predictive model is preferably a VGG16 model; and when VGG16 is selected as a prediction model, the model can be trained by adopting the following steps:
(1) Obtaining a VGG16 network structure model (shown in figure 3) trained on an ImageNet task and a corresponding weight parameter file; the network structure model comprises 13 convolution layers;
(2) Loading the network and the weight obtained in the step (1) as a pre-training network, and freezing the network to ensure that the network does not participate in training;
(3) Defining a two-class network, and training on the training set obtained in the step S4; the two classification networks have 3 layers, the first 2 layers are full-connection layers, the number of neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer used for two classifications;
(4) After training the classification network, thawing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) The parameters of the model training process are set as follows:
the total number of parameters for model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrainable parameters is 700 ten thousand;
the loss function is a two-class cross entropy loss, and the calculation formula is
Figure BDA0002401015310000101
/>
Where i is each sample, t i A real label of the sample i; y is i A predictive tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e -5 The iteration number is 30;
the evaluation index is accuracy, and the calculation formula is:
Figure BDA0002401015310000102
wherein Truepositive is the number of samples predicted to be positive and true to be positive; tuneplus is the number of samples predicted negative and truly negative; pulsamps is the total number of samples;
setting a ReduceLROnPlateeau to monitor the learning rate every 2 iterations, and if the learning rate is monitored to be not reduced, adjusting the learning rate to be reduced by 50%;
if the evaluation index accuracy is not reduced in 10 iterations, stopping the iterations in advance
S6, calculating evaluation parameters (preferably AUC values) of the neural network intron-preservation prediction model on the test set obtained in the step S4 according to the neural network intron-preservation prediction model obtained in the step S5;
s7, acquiring an intron sequence reading distribution mode picture test set of the real data obtained in the step S1; specifically, inputting a sequence file APP of real data obtained in the step S1 into a prediction tool iREAD and a prediction tool IRFinder to respectively obtain two groups of intron reserved prediction sets IR1 and IR2; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; then, carrying out operations such as IGV (IGV) visualization, picture cutting and merging on each intron coordinate in the intersection IC so as to obtain an intron sequence reading distribution mode picture test set real_test of real data;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the neural network intron retention prediction model obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring 5 'end sequences of W1 bases and N1 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 5' end sequences are W1+N1 bases in total;
s10, acquiring 3 '-end sequences of W2 bases and N2 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 3' -end sequences are W2+N2 bases;
s11, calculating the strength of the splice site according to the 5 'end sequence of the W1+N1 base obtained in the step S9 and the 3' end sequence of the W2+N2 base obtained in the step S10, so as to obtain a 5 'end average splice site strength value and a 3' end average splice site strength value; inputting the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 into a MaxEntScan model, and scoring by adopting a maximum entropy model so as to obtain a given splice site strength value; then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged, so that a final 5 'end average splice site intensity value and a 3' end average splice site intensity value are obtained;
s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the average splice site intensity value of the 5 'end and the average splice site intensity value of the 3' end of the neural network intron retention prediction model are smaller, the prediction effect of the neural network intron retention prediction model is better.
The method of the invention is verified as follows:
the invention is evaluated on the simulation data SIMU30 and the real data set APP, while the tools compared to the invention are iREAD and irFinder.
1) SIMU30 analog dataset experimental analysis
For 3000 test set samples of SIMU30 simulation data, the predicted Accuracy of the present invention thereon reached 0.925 and the AUC reached 0.975;
2) APP real dataset experimental analysis
Because the real data lacks gold standard, on one hand, only the predictive label of other methods can be used as the real label, and the difference between the AUC of the VGG16 model and other methods is tested; on the other hand, other evaluation indexes can be customized to verify the effectiveness of the invention. In terms of AUC evaluation, the comparison of the VGG16 model of the invention with iREAD and IRFinder after predicting real data picture test set real_test is shown in Table 1.real_test is 68326 samples, when iREAD is used as a gold standard, the positive sample number is 2816, and the negative sample number is 65510, and the AUC of the VGG16 model of the invention is superior to IRFinder. When IRFinder is used as the gold standard, the positive sample number is 19044, and the negative sample number is 49282, and the invention is also superior to iREAD.
TABLE 1 AUC evaluation results of the invention with iREAD and IRFinder schematic tables
Figure BDA0002401015310000121
In addition, the invention also defines the 5 'end and 3' end splice site intensities to measure the VGG16 model prediction effect, and the lower the average splice site intensity is, the better the model overall prediction effect is. The results of the average splice site strength evaluation are shown in Table 2.
TABLE 2 Table schematically shows the results of the evaluation of the average splice site intensities of the invention with iREAD and IRFinder
Figure BDA0002401015310000131
From the results in table 2, while the results of the present invention are slightly worse than IRFinder and iREAD in terms of average splice site strength, it is noted that as the number of introns involved in calculating the average splice site strength increases, the average splice site strength of IRFinder and iREAD increases and the present invention decreases. Therefore, the VGG16 model designed by the invention is superior to IRFinder and iREAD in robustness.
FIG. 4 is a schematic flow chart of the prediction method of the present invention: the prediction method comprising the intron retention prediction model establishment method provided by the invention specifically comprises the following steps:
s1, collecting analog data and real data related to intron reservation; specifically, a BEER algorithm is adopted to generate a simulation data sequence file SIMU30 containing the number of the determined introns; the sequencing depth of the SIMU30 is three tens of millions, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model study of the Alzheimer's disease acceleration drug cooperation project, the sequencing depth is one hundred million, and the reading length is 101 bases;
s2, defining all independent inclusion subsets in the genome to be combined as a standard template; the method is characterized by comprising the following steps:
A. extracting all Independent inclusion subsets indepent_introns from the release-75 version of the annotated gtf file of the GRCm38 mouse genome; the definition of the independent introns is introns that do not overlap with any homotypic exons;
wherein, all Independent introns are extracted from the Independent introns subset indepent_introns, specifically, all exons in one chromosome are combined, and then all exons are deleted from a gene region, so that all Independent introns are obtained;
B. in the Independent introns subset Inndepend_introns obtained in the step A, merging introns with overlapping coordinate intervals by taking a gene as a unit to obtain a final Independent introns subset intron cluster;
s3, acquiring an intron sequence reading distribution mode picture data set in the simulation data obtained in the step S1, and preprocessing to obtain a processed data set; the method comprises the following steps of obtaining a data set and performing data:
a. performing IGV (IGV-based visual) on each intron in the SIMU30 of the simulation data sequence file obtained in the step S1 to obtain a preliminary visual image;
b. because the length of each intron is variable and the difference is extremely large, 20 bases left and right of the 5 'end and the 3' end of each intron are respectively stored, and two-segment sequence visual images with the length of 40 bases are respectively stored; the height of the visual image is 100mm, and the height of the bar chart representing the abundance of the base is standardized;
c. for the image obtained in the step b, the visualized image of the single-segment sequence is originally 621 pixels long and 1150 pixels long, so that the part of the whole image with the length of 131-231 pixels and the part with the length of 280-1070 pixels are cut;
d. transversely combining the images obtained by cutting in the step c, so as to obtain a final processed data set; the visual result is shown in fig. 2;
s4, dividing the processed data set obtained in the step S3 into a training set and a testing set according to a set proportion; specifically, in the analog data sequence file SIMU30 obtained in step S1, an intron whose total sequence reading is greater than a first set value (e.g., 10), whose FPKM (number of fragments matching every kilobases in a gene per million reads, fragments Per Kilobase Million) is greater than a second set value (e.g., 0.3), whose consecutive reads are greater than a third set value (e.g., 1) is defined as a positive sample, and whose remaining introns are negative samples; then randomly extracting X2 (such as 5000) positive samples and X2 negative samples from the positive and negative samples to form a final data set; then dividing the data set into a training set and a testing set according to a set proportion (such as 7:3); x2 is a positive integer.
S5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain a finally established neural network intron retention prediction model; in practice, the predictive model is preferably a VGG16 model; and when VGG16 is selected as a prediction model, the model can be trained by adopting the following steps:
(1) Obtaining a VGG16 network structure model (shown in figure 3) trained on an ImageNet task and a corresponding weight parameter file; the network structure model comprises 13 convolution layers;
(2) Loading the network and the weight obtained in the step (1) as a pre-training network, and freezing the network to ensure that the network does not participate in training;
(3) Defining a two-class network, and training on the training set obtained in the step S4; the two classification networks have 3 layers, the first 2 layers are full-connection layers, the number of neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer used for two classifications;
(4) After training the classification network, thawing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) The parameters of the model training process are set as follows:
the total number of parameters for model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrainable parameters is 700 ten thousand;
the loss function is a two-class cross entropy loss, and the calculation formula is
Figure BDA0002401015310000151
/>
Where i is each sample, t i A real label of the sample i; y is i A predictive tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e -5 The iteration number is 30;
the evaluation index is accuracy, and the calculation formula is:
Figure BDA0002401015310000161
wherein Truepositive is the number of samples predicted to be positive and true to be positive; tuneplus is the number of samples predicted negative and truly negative; pulsamps is the total number of samples;
setting a ReduceLROnPlateeau to monitor the learning rate every 2 iterations, and if the learning rate is monitored to be not reduced, adjusting the learning rate to be reduced by 50%;
if the evaluation index accuracy is not reduced in 10 iterations, stopping the iterations in advance
S6, calculating evaluation parameters (preferably AUC values) of the neural network intron-preservation prediction model on the test set obtained in the step S4 according to the neural network intron-preservation prediction model obtained in the step S5;
s7, acquiring an intron sequence reading distribution mode picture test set of the real data obtained in the step S1; specifically, inputting a sequence file APP of real data obtained in the step S1 into a prediction tool iREAD and a prediction tool IRFinder to respectively obtain two groups of intron reserved prediction sets IR1 and IR2; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; then, carrying out operations such as IGV (IGV) visualization, picture cutting and merging on each intron coordinate in the intersection IC so as to obtain an intron sequence reading distribution mode picture test set real_test of real data;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the neural network intron retention prediction model obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring 5 'end sequences of W1 bases and N1 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 5' end sequences are W1+N1 bases in total;
s10, acquiring 3 '-end sequences of W2 bases and N2 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 3' -end sequences are W2+N2 bases;
s11, calculating the strength of the splice site according to the 5 'end sequence of the W1+N1 base obtained in the step S9 and the 3' end sequence of the W2+N2 base obtained in the step S10, so as to obtain a 5 'end average splice site strength value and a 3' end average splice site strength value; inputting the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 into a MaxEntScan model, and scoring by adopting a maximum entropy model so as to obtain a given splice site strength value; then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged, so that a final 5 'end average splice site intensity value and a 3' end average splice site intensity value are obtained;
s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the average splice site strength value of the 5 'end and the average splice site strength value of the 3' end of the neural network intron retention prediction model are smaller, the prediction effect of the neural network intron retention prediction model is better;
s13, predicting an intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.

Claims (12)

1. An intron retention prediction model building method comprises the following steps:
s1, collecting analog data and real data related to intron reservation; specifically, a BEER algorithm is adopted to generate a simulation data sequence file SIMU30 containing the number of the determined introns; the sequencing depth of the SIMU30 is three tens of millions, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model study of the Alzheimer's disease acceleration drug cooperation project, the sequencing depth is one hundred million, and the reading length is 101 bases;
s2, defining all independent inclusion subsets in the genome to be combined as a standard template;
s3, acquiring an intron sequence reading distribution mode picture data set in the simulation data obtained in the step S1, and preprocessing to obtain a processed data set; the method comprises the following steps of obtaining a data set and performing data:
a. performing IGV (IGV-based visual) on each intron in the SIMU30 of the simulation data sequence file obtained in the step S1 to obtain a preliminary visual image;
b. respectively storing 20 bases left and right of the 5 'end and the 3' end of each intron, and two sections of sequence visual images with the length of 40 bases in total; the height of the visual image is 100mm, and the height of the bar chart representing the abundance of the base is standardized;
c. cutting out a part of the whole image, which is 131-231 pixels long and 280-1070 pixels long, from the image obtained in the step b;
d. transversely combining the images obtained by cutting in the step c, so as to obtain a final processed data set;
s4, dividing the processed data set obtained in the step S3 into a training set and a testing set according to a set proportion;
s5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain the finally established neural network intron retention prediction model.
2. The method for constructing an intron retention prediction model according to claim 1, further comprising the steps of:
s6, calculating evaluation parameters of the neural network intron-reserved prediction model on the test set obtained in the step S4 according to the neural network intron-reserved prediction model obtained in the step S5;
s7, acquiring an intron sequence reading distribution mode picture test set of the real data obtained in the step S1;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the neural network intron retention prediction model obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring 5 'end sequences of W1 bases and N1 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 5' end sequences are W1+N1 bases in total;
s10, acquiring 3 '-end sequences of W2 bases and N2 bases on the intronic side of the revealing coordinate in the predicted intronic retention set obtained in the step S8, wherein the 3' -end sequences are W2+N2 bases;
s11, calculating the strength of the splice site according to the 5 'end sequence of the W1+N1 base obtained in the step S9 and the 3' end sequence of the W2+N2 base obtained in the step S10, so as to obtain a 5 'end average splice site strength value and a 3' end average splice site strength value;
s12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11.
3. The method for constructing an intron retention prediction model according to claim 2, wherein the defining of all independent intronic subsets in the genome in step S2 is performed by combining the independent intronic subsets as standard templates, specifically by:
A. extracting all Independent inclusion subsets indepent_introns from the release-75 version of the annotated gtf file of the GRCm38 mouse genome; the definition of the independent introns is introns that do not overlap with any homotypic exons;
B. and D, merging introns with overlapping coordinate intervals in the Independent introns subset Indapent_introns obtained in the step A by taking a gene as a unit to obtain a final Independent introns set.
4. The method of claim 3, wherein the step A extracts all the Independent introns indet_introns, specifically by combining all the exons in one chromosome, and then deleting all the exons from the gene region, thereby obtaining all the Independent introns.
5. The method for creating an intron retention prediction model according to claim 4, wherein in step S4, the processed data set obtained in step S3 is divided into a training set and a test set according to a set proportion, specifically, in the analog data sequence file SIMU30 obtained in step S1, introns with total sequence reading greater than a first set value, FPKM greater than a second set value and continuous reading greater than a third set value are defined as positive samples, and the remaining introns are negative samples; then randomly extracting X2 positive samples and X2 negative samples from the positive and negative samples to form a final data set; dividing the data set into a training set and a testing set according to a set proportion; x2 is a positive integer.
6. The method of claim 5, wherein the neural network model in step S5 is a VGG16 network structure model.
7. The method for creating the intronic retention prediction model according to claim 6, wherein the training set obtained in step S4 is used to train the neural network model in step S5, so as to obtain the final created neural network intronic retention prediction model, specifically, the training model is used in the following steps:
(1) Obtaining a VGG16 network structure model trained on an ImageNet task and a corresponding weight parameter file; the network structure model comprises 13 convolution layers;
(2) Loading the network and the weight obtained in the step (1) as a pre-training network, and freezing the network to ensure that the network does not participate in training;
(3) Defining a two-class network, and training on the training set obtained in the step S4; the two classification networks are divided into 3 layers, the first 2 layers are full-connection layers, the number of neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively; the last layer is a sigmoid layer used for two classifications;
(4) After training the classification network, thawing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) The parameters of the model training process are set as follows:
the total number of parameters for model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrainable parameters is 700 ten thousand;
the loss function is a two-class cross entropy loss, and the calculation formula is
Figure FDA0004141465690000041
Where i is each sample, t i A real label of the sample i; y is i A predictive tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e -5 The iteration number is 30;
the evaluation index is accuracy, and the calculation formula is:
Figure FDA0004141465690000042
wherein Truepositive is the number of samples predicted to be positive and true to be positive; tuneplus is the number of samples predicted negative and truly negative; pulsamps is the total number of samples;
setting a ReduceLROnPlateeau to monitor the learning rate every 2 iterations, and if the learning rate is monitored to be not reduced, adjusting the learning rate to be reduced by 50%;
if the evaluation index accuracy is not reduced in 10 iterations, stopping the iterations in advance.
8. The method for building an intron retention prediction model according to claim 7, wherein the evaluation parameters of the neural network intron retention prediction model are calculated on the test set obtained in step S4 in step S6, and specifically the AUC values of the neural network intron retention prediction model are calculated on the test set obtained in step S4.
9. The method for establishing the intronic reserve prediction model according to claim 8, wherein in the step S7, the intronic sequence reading distribution mode picture test set of the real data obtained in the step S1 is obtained, specifically, the sequence file APP of the real data obtained in the step S1 is input into a prediction tool iREAD and a prediction tool irFinder to obtain two sets of intronic reserve prediction sets IR1 and IR2 respectively; mapping IR1 and IR2 to an independent inclusion subset cluster according to a rule with the largest length of a matched coordinate interval, and taking an intersection of the two to obtain an intersection IC; and then, performing IGV (IGV) visualization, picture clipping and merging operation on each intron coordinate in the intersection IC, so as to obtain an intron sequence reading distribution mode picture test set real_test of the real data.
10. The method for constructing an intron retention prediction model according to claim 9, wherein the 5 'end sequence of w1+n1 bases obtained in step S9 and the 3' end sequence of w2+n2 bases obtained in step S10 described in step S11 are used for calculating splice site intensities, thereby obtaining a 5 'end average splice site intensity value and a 3' end average splice site intensity value, specifically, the 5 'end sequence score5ss sequence set obtained in step S9 and the 3' end sequence score3ss sequence set obtained in step S10 are input into a MaxEntScan model, and scoring is performed by using a maximum entropy model, thereby obtaining a given splice site intensity value; and then, the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence are averaged to obtain a final 5 'end average splice site intensity value and a 3' end average splice site intensity value.
11. The method for building an intron retention prediction model according to claim 10, wherein the step S12 is characterized in that the prediction effect of the neural network intron retention prediction model built in step S5 is better if the 5 'average splice site strength value and the 3' average splice site strength value of the neural network intron retention prediction model are smaller according to the 5 'average splice site strength value and the 3' average splice site strength value obtained in step S11.
12. A prediction method comprising the intronic retention prediction model building method according to any one of claims 1 to 11, specifically further comprising the steps of:
s13, predicting an intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.
CN202010146731.2A 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof Active CN111370055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010146731.2A CN111370055B (en) 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010146731.2A CN111370055B (en) 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof

Publications (2)

Publication Number Publication Date
CN111370055A CN111370055A (en) 2020-07-03
CN111370055B true CN111370055B (en) 2023-05-23

Family

ID=71208615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010146731.2A Active CN111370055B (en) 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof

Country Status (1)

Country Link
CN (1) CN111370055B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102605084B1 (en) * 2020-12-10 2023-11-24 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by dete cting intron retention using transcriptome analysis
WO2023238973A1 (en) * 2022-06-10 2023-12-14 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by detecting intron retention using transcriptome analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999066302A2 (en) * 1998-06-17 1999-12-23 Musc Foundation For Research Development Recognition of protein coding regions in genomic dna sequences
WO2019226804A1 (en) * 2018-05-23 2019-11-28 Envisagenics, Inc. Systems and methods for analysis of alternative splicing

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6807491B2 (en) * 2001-08-30 2004-10-19 Hewlett-Packard Development Company, L.P. Method and apparatus for combining gene predictions using bayesian networks
WO2008097632A2 (en) * 2007-02-08 2008-08-14 Jiv An Biologics, Inc. Methods for determining splice variant types and amounts
US20120185172A1 (en) * 2011-01-18 2012-07-19 Barash Joseph Method, system and apparatus for data processing
PT3298134T (en) * 2015-05-16 2023-08-18 Genzyme Corp Gene editing of deep intronic mutations
CN105975809A (en) * 2016-05-13 2016-09-28 万康源(天津)基因科技有限公司 SNV detection method affecting RNA splicing
NZ759818A (en) * 2017-10-16 2022-04-29 Illumina Inc Semi-supervised learning for training an ensemble of deep convolutional neural networks
CN110010201A (en) * 2019-04-16 2019-07-12 山东农业大学 A kind of site recognition methods of RNA alternative splicing and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999066302A2 (en) * 1998-06-17 1999-12-23 Musc Foundation For Research Development Recognition of protein coding regions in genomic dna sequences
WO2019226804A1 (en) * 2018-05-23 2019-11-28 Envisagenics, Inc. Systems and methods for analysis of alternative splicing

Also Published As

Publication number Publication date
CN111370055A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN110785814A (en) Predicting quality of sequencing results using deep neural networks
CN112687327B (en) Cancer survival analysis system based on multitasking and multi-mode
CN111370055B (en) Intron retention prediction model establishment method and prediction method thereof
JP6066924B2 (en) DNA sequence data analysis method
CN106909901A (en) The method and device of detection object from image
CN112669899B (en) 16S and metagenome sequencing data correlation analysis method, system and equipment
CN115620812B (en) Resampling-based feature selection method and device, electronic equipment and storage medium
CN114822698B (en) Knowledge reasoning-based biological large sample data set analysis method and system
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
CN104615910A (en) Method for predicating helix interactive relationship of alpha transmembrane protein based on random forest
US20230073973A1 (en) Deep learning based system and method for prediction of alternative polyadenylation site
WO2023061174A1 (en) Method and apparatus for constructing risk prediction model for autism spectrum disorder
CN110796381A (en) Method and device for processing evaluation indexes of modeling data, terminal equipment and medium
CN112488188B (en) Feature selection method based on deep reinforcement learning
CN114496099A (en) Cell function annotation method, device, equipment and medium
CN115167965A (en) Transaction progress bar processing method and device
CN110223732B (en) Integration method of multi-class biological sequence annotation
CN113780334A (en) High-dimensional data classification method based on two-stage mixed feature selection
CN113782092A (en) Method and device for generating life prediction model and storage medium
CN112861689A (en) Searching method and device of coordinate recognition model based on NAS technology
Cai et al. Application and research progress of machine learning in Bioinformatics
CN114705148B (en) Road bending point detection method and device based on secondary screening
TWI650664B (en) Method for establishing assessment model for protein loss of function and risk assessment method and system using the assessment model
CN111160454B (en) Quick change signal detection method and device
Piernik et al. DBFE: distribution-based feature extraction from structural variants in whole-genome data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant