CN111370055A - Intron retention prediction model establishing method and prediction method thereof - Google Patents

Intron retention prediction model establishing method and prediction method thereof Download PDF

Info

Publication number
CN111370055A
CN111370055A CN202010146731.2A CN202010146731A CN111370055A CN 111370055 A CN111370055 A CN 111370055A CN 202010146731 A CN202010146731 A CN 202010146731A CN 111370055 A CN111370055 A CN 111370055A
Authority
CN
China
Prior art keywords
intron
prediction model
training
sequence
splice site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010146731.2A
Other languages
Chinese (zh)
Other versions
CN111370055B (en
Inventor
李洪东
郑剑涛
林翠香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010146731.2A priority Critical patent/CN111370055B/en
Publication of CN111370055A publication Critical patent/CN111370055A/en
Application granted granted Critical
Publication of CN111370055B publication Critical patent/CN111370055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a method for establishing an intron retention prediction model, which comprises the steps of collecting analog data and real data related to intron retention; defining all independent inclusion subsets in a genome to be combined as a standard template; acquiring a picture data set of a reading distribution mode of the subsequence with the content set in the obtained simulation data and processing the picture data set to obtain a processed data set; dividing the processed data set into a training set and a test set according to a set proportion; and training the neural network model by adopting a training set to obtain a finally established neural network intron retention prediction model. The invention also discloses a prediction method comprising the intron retention prediction model establishing method. The invention can carry out visualization and prediction on the intron based on the intron retention reading distribution mode, and has high reliability and good accuracy.

Description

Intron retention prediction model establishing method and prediction method thereof
Technical Field
The invention specifically designs an intron retention prediction model building method and a prediction method thereof.
Background
Intron retention is one of the alternative splicing, meaning that introns in the precursor mRNA are not spliced out and remain in the mature mRNA. Intron retention was previously thought to be the result of mis-splicing and received less attention. There have been many recent studies showing that: intron retention is associated with gene expression regulation and complex diseases (e.g., alzheimer's disease); with the development of high throughput sequencing technology, many methods for intron retention detection have been proposed, and are highlighted by iREAD and IRFinder. Wherein iREAD detects intron retention by computing entropy values assuming that the readings of intron retention are uniformly distributed, and the corresponding filtering criteria are relatively stringent. IRFinder then measures intron retention by calculating the proportion of IR-ratio indicative of the occurrence of an intron in a transcript.
Although the method has been successfully applied to a real environment, the analysis based on the sequence characteristics is more or less limited by the deviation possibly caused by the retention of the intron, so that the robustness of the method is insufficient, and the reliability of the current method is not high, thereby restricting the development of the related technology.
Disclosure of Invention
The invention aims to provide a method for establishing an intron retention prediction model with high reliability and high accuracy.
The invention also aims to provide a prediction method comprising the intron-retention prediction model building method.
The invention provides a method for establishing an intron retention prediction model, which comprises the following steps:
s1, collecting analog data and real data related to intron retention;
s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template;
s3, acquiring a picture data set with a reading distribution mode of the subsequence with the built-in subsequence set in the simulation data obtained in the step S1, and preprocessing the picture data set to obtain a processed data set;
s4, dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion;
and S5, training the neural network model by adopting the training set obtained in the step S4, thereby obtaining the finally established neural network intron retention prediction model.
The method for establishing the intron retention prediction model further comprises the following steps:
s6, calculating evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step S4 according to the neural network intron retention prediction model obtained in the step S5;
s7, acquiring an image test set of the reading distribution mode of the intron subsequence of the real data obtained in the step S1;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the prediction model of the intron retention in the neural network obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring a 5' end sequence of W1+ N1 bases on the exon side W1 bases and the intron side N1 bases of the predicted intron retention set obtained in the step S8;
s10, acquiring a 3' end sequence of W2+ N2 bases on the exon side W2 bases and the intron side N2 bases of the predicted intron retention set obtained in the step S8;
s11, calculating the splice site intensity according to the 5 'end sequence of W1+ N1 bases obtained in the step S9 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value;
and S12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11.
Step S1, collecting simulation data and real data related to intron retention, specifically, generating a simulation data sequence file SIMU30 containing a determined number of introns by using a BER algorithm; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP from the study of Tau and APP mouse models of Alzheimer's disease accelerated drug cooperation projects, the sequencing depth is one hundred million, and the reading length is 101 bases.
All independent inclusion subsets in the defined genome described in step S2 are combined as a standard template, specifically defined by the following steps:
A. extracting all Independent intron sets independentintron from the annotated gtf file of release-75 version of the GRCm38 mouse genome; the independent intron is defined as an intron that does not overlap with any of the exons of the same type;
B. and D, combining the introns with the overlapped coordinate intervals by taking the gene as a unit in the Independent intron set Independent _ intron obtained in the step A to obtain the final Independent intron set intron cluster.
The step A of extracting all Independent intron sets independentlntron specifically comprises merging all exons in a chromosome, and then deleting all exons from a gene region, thereby obtaining all Independent introns.
Step S3, the step of obtaining the picture data set with the reading distribution pattern of the subsequence with inclusion set in the simulation data obtained in step S1, and performing preprocessing to obtain a processed data set, specifically, the step of obtaining the data set and performing data processing includes:
a. performing IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in the step S1 to obtain a primary visualized image;
b. respectively storing two segments of sequence visual images of 20 bases on the left and right of the 5 'end and the 3' end of each intron, wherein the length of the two segments of sequence visual images is 40 bases; the height of the visualized image is 100mm, and meanwhile, the height of a bar graph representing the abundance of the base is subjected to standardization processing;
c. b, cutting a part with the length of 131-231 pixels and a part with the length of 280-1070 pixels of the whole image for the image obtained in the step b;
d. and c, transversely combining the images obtained by cutting in the step c to obtain a final processed data set.
The step S4 of dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion is specifically that in the simulated data sequence file SIMU30 obtained in the step S1, an intron whose total sequence read is greater than a first set value, FPKM (number of Fragments matched to every thousand bases in a gene in each million reads, Fragments Per klobase million) is defined as a positive sample, and the remaining intron is a negative sample, where the number of Fragments matched to every thousand bases in the gene is greater than a second set value and the number of continuous reads is greater than a third set value; then randomly extracting X2 positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a test set according to a set proportion; x2 is a positive integer.
The neural network model in step S5 is specifically a VGG16 network structure model.
Step S5, training the neural network model by using the training set obtained in step S4, so as to obtain a finally established prediction model for preserving introns in the neural network, specifically training the model by using the following steps:
(1) obtaining a VGG16 network structure model which is trained on an ImageNet task and a corresponding weight parameter file; the network structure model worker comprises 13 convolution layers;
(2) loading the network and the weight obtained in the step (1) as a pre-training network, but freezing the network so as to ensure that the network does not participate in training;
(3) defining a two-classification network, and training on the training set obtained in the step S4; the two classification networks have 3 layers in total, the first 2 layers are full connection layers, the number of the neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer and is used for secondary classification;
(4) after the classification network is trained, unfreezing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) parameters for the model training process were set as follows:
the total parameter number of model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrained parameters is 700 thousand;
the loss function is binary cross entropy loss, and the calculation formula is
Figure BDA0002401015310000051
Where i is each sample, tiIs the true label of sample i; y isiIs a prediction tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e-5The number of iterations is 30;
the evaluation index is accuracy, and the calculation formula is as follows:
Figure BDA0002401015310000052
wherein Truepositive is the number of samples predicted to be positive and truly positive; turenergive is the number of samples predicted to be negative and truly negative; allsampies is the total number of samples;
setting the monitoring learning rate of reduceLROnPateau for every 2 iterations, and if the monitoring learning rate is not reduced, reducing the adjusting learning rate by 50%;
and if the evaluation index accuracy does not decrease in 10 iterations, stopping the iteration in advance.
In step S6, the evaluation parameters of the prediction model for preserving introns of the neural network are calculated on the test set obtained in step S4, specifically, AUC values of the prediction model for preserving introns of the neural network are calculated on the test set obtained in step S4.
Step S7, obtaining the intron sequence reading distribution pattern picture test set of the real data obtained in step S1, specifically, inputting the sequence file APP of the real data obtained in step S1 into a prediction tool ired and a prediction tool IRFinder, and obtaining two groups of intron retention prediction sets IR1 and IR2, respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; and then, carrying out IGV visualization, picture cutting and merging operation on each intron coordinate in the intersection IC, thereby obtaining the real data intron subsequence reading distribution mode picture test set real _ test.
Calculating splice site intensities according to the 5 'end sequence of W1+ N1 bases obtained in the step S11 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value, specifically, inputting the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 into a MaxEntScan model, and scoring by using a maximum entropy model so as to obtain a given splice site intensity value; and then averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value.
And S12, evaluating the neural network intron retention prediction model established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11, wherein the prediction effect of the neural network intron retention prediction model is better if the 5 'end average splice site intensity value and the 3' end average splice site intensity value of the neural network intron retention prediction model are smaller.
The invention also provides a prediction method comprising the intron retention prediction model building method, and specifically comprises the following steps:
and S13, predicting the intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.
According to the method for establishing the intron retention prediction model and the prediction method thereof, the intron retention deep learning prediction method based on the intron retention reading distribution mode can more generally predict the intron retention in an easily-interpreted manner; based on the intron retention reading distribution mode, the knowledge structure of the large-scale image classification task is migrated by combining deep learning model construction and migration learning application, and the learning effect of the intron retention prediction task is completed and improved; meanwhile, the prediction effect evaluation is carried out on a real data set without a gold standard, and the calculation of the average splice site strength of the 5 'end sequence and the 3' end sequence of the predicted intron retention sequence is proposed to measure the quality of the overall prediction effect; therefore, the method can be used for visualizing and predicting the intron based on the intron retention reading distribution mode, and is high in reliability and good in accuracy.
Drawings
FIG. 1 is a schematic flow chart of a method for building an intron-retention prediction model according to the present invention.
FIG. 2 is a diagram illustrating the result of visualizing the distribution pattern of intron-retained readings according to the present invention.
FIG. 3 is a schematic structural diagram of the deep learning model VGG16 of the present invention.
FIG. 4 is a flow chart of a prediction method according to the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of a method for building an intron-retention prediction model according to the present invention: the invention provides a method for establishing an intron retention prediction model, which comprises the following steps:
s1, collecting analog data and real data related to intron retention; specifically, a BEER algorithm is adopted to generate a SIMU30 containing a simulation data sequence file with a determined number of introns; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model research of the Alzheimer disease accelerating drug cooperation project, wherein the sequencing depth is one hundred million, and the reading length is 101 basic groups;
s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template; the invention has particular application to mice, and thus the genome may be a mouse genome; the method is defined by the following steps:
A. extracting all Independent intron sets independentintron from the annotated gtf file of release-75 version of the GRCm38 mouse genome; the independent intron is defined as an intron that does not overlap with any of the exons of the same type;
wherein, extracting all Independent intron sets independentlntron, specifically combining all exons in a chromosome, and then deleting all exons from a gene region, thereby obtaining all Independent introns;
B. combining introns with overlapped coordinate intervals in the Independent intron set Independent _ intron obtained in the step A by taking the gene as a unit to obtain a final Independent intron set intron cluster;
s3, acquiring a picture data set with a reading distribution mode of the subsequence with the built-in subsequence set in the simulation data obtained in the step S1, and preprocessing the picture data set to obtain a processed data set; specifically, the following steps are adopted to obtain a data set and perform data:
a. performing IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in the step S1 to obtain a primary visualized image;
b. because the length of each intron is indefinite and the difference is extremely large, two segments of sequence visualization images with the length of 40 bases are respectively stored at the left and right of the 5 'end and the 3' end of each intron; the height of the visualized image is 100mm, and meanwhile, the height of a bar graph representing the abundance of the base is subjected to standardization processing;
c. for the image obtained in the step b, the original longitudinal length of the single-segment sequence of the visual images is 621 pixels, the transverse length of the single-segment sequence is 1150 pixels, and therefore the part of the whole image with the longitudinal length of 131-231 pixels and the part with the transverse length of 280-1070 pixels are cut;
d. c, transversely combining the images obtained by cutting in the step c to obtain a final processed data set; the visualization results are shown in fig. 2;
s4, dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion; specifically, in the simulation data sequence file SIMU30 obtained in step S1, it is defined that the total sequence read is greater than a first set value (e.g., 10), FPKM (number of Fragments matched to each kilobase in a gene Per million reads, Fragments Per thousand base million) is greater than a second set value (e.g., 0.3), and the intron whose consecutive read is greater than a third set value (e.g., 1) is a positive sample, and the remaining introns are negative samples; then randomly extracting X2 (such as 5000) positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a testing set according to a set proportion (such as 7: 3); x2 is a positive integer.
S5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain a finally established neural network intron retention prediction model; in particular, the prediction model is preferably a VGG16 model; and when the VGG16 is selected as the prediction model, the following steps can be adopted to train the model:
(1) obtaining a VGG16 network structure model (as shown in FIG. 3) which is trained on an ImageNet task and a corresponding weight parameter file; the network structure model worker comprises 13 convolution layers;
(2) loading the network and the weight obtained in the step (1) as a pre-training network, but freezing the network so as to ensure that the network does not participate in training;
(3) defining a two-classification network, and training on the training set obtained in the step S4; the two classification networks have 3 layers in total, the first 2 layers are full connection layers, the number of the neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer and is used for secondary classification;
(4) after the classification network is trained, unfreezing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) parameters for the model training process were set as follows:
the total parameter number of model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrained parameters is 700 thousand;
the loss function is binary cross entropy loss, and the calculation formula is
Figure BDA0002401015310000101
Where i is each sample, tiIs the true label of sample i; y isiIs a prediction tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e-5The number of iterations is 30;
the evaluation index is accuracy, and the calculation formula is as follows:
Figure BDA0002401015310000102
wherein Truepositive is the number of samples predicted to be positive and truly positive; turenergive is the number of samples predicted to be negative and truly negative; allsampies is the total number of samples;
setting the monitoring learning rate of reduceLROnPateau for every 2 iterations, and if the monitoring learning rate is not reduced, reducing the adjusting learning rate by 50%;
if the evaluation index accuracy is not reduced in 10 iterations, stopping the iteration in advance
S6, according to the prediction model of the intron retention in the neural network obtained in the step S5, calculating evaluation parameters (preferably AUC values) of the prediction model of the intron retention in the neural network on the test set obtained in the step S4;
s7, acquiring an image test set of the reading distribution mode of the intron subsequence of the real data obtained in the step S1; specifically, the sequence file APP of the real data obtained in the step S1 is input into a prediction tool iREAD and a prediction tool IRFinder, and two groups of intron retention prediction sets IR1 and IR2 are obtained respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; then, carrying out operations such as IGV visualization, picture cutting, merging and the like on each intron coordinate in the intersection IC, thereby obtaining an intron subsequence reading distribution mode picture test set real _ test of real data;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the prediction model of the intron retention in the neural network obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring a 5' end sequence of W1+ N1 bases on the exon side W1 bases and the intron side N1 bases of the predicted intron retention set obtained in the step S8;
s10, acquiring a 3' end sequence of W2+ N2 bases on the exon side W2 bases and the intron side N2 bases of the predicted intron retention set obtained in the step S8;
s11, calculating the splice site intensity according to the 5 'end sequence of W1+ N1 bases obtained in the step S9 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value; specifically, the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 are input into a MaxEntScan model, and a maximum entropy model is adopted for scoring, so that a given splice site intensity value is obtained; then, averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value;
s12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the value of the 5 'end average splice site intensity and the value of the 3' end average splice site intensity of the intron retention prediction model of the neural network are smaller, the prediction effect of the intron retention prediction model of the neural network is better.
The method of the invention was verified as follows:
the invention is evaluated on the simulation data SIMU30 and the real data set APP, while the tools compared with the invention are iREAD and IRFinder.
1) SIMU30 simulation dataset experimental analysis
For 3000 test set samples of SIMU30 simulation data, Accuracy predicted thereon by the present invention reached 0.925 and AUC reached 0.975;
2) APP true data set experimental analysis
Because the real data lack of gold standard, on one hand, the prediction labels of other methods can only be used as real labels, and the difference between the AUC of the VGG16 model and other methods is tested; on the other hand, other evaluation indexes can be customized to verify the effectiveness of the invention. In the aspect of AUC evaluation, the VGG16 model of the invention is compared with iREAD and IRFinder after predicting real data picture test set real _ test, see Table 1. The real _ test has 68326 samples, the positive sample number is 2816 and the negative sample number is 65510 when the iREAD is taken as the gold standard, and the AUC of the VGG16 model of the invention is superior to the IRFinder. The invention also outperforms iREAD when IRFinder is the gold standard, positive samples are 19044 and negative samples are 49282.
TABLE 1 AUC evaluation results of the present invention and iREAD and IRFinder are shown schematically
Figure BDA0002401015310000121
In addition, the invention also defines the splice site intensity of the 5 'end and the 3' end to measure the prediction effect of the VGG16 model, and the lower the average splice site intensity is, the better the overall prediction effect of the model is. The results of the evaluation of the average splice site strengths are shown in Table 2.
TABLE 2 evaluation results of the average splice site strengths of the present invention with iREAD and IRFinder are shown schematically
Figure BDA0002401015310000131
From the results in Table 2, although the results of the present invention are slightly inferior to IRFinder and iREAD in the average splice site strength, it is noted that the average splice site strength of IRFinder and iREAD is increased as the number of introns involved in calculating the average splice site strength is increased, and the present invention is decreased. Thus reflecting that the VGG16 model designed by the invention is superior to IRFinder and iREAD in robustness.
FIG. 4 is a schematic flow chart of the prediction method of the present invention: the prediction method provided by the invention comprises the intron retention prediction model building method, and specifically comprises the following steps:
s1, collecting analog data and real data related to intron retention; specifically, a BEER algorithm is adopted to generate a SIMU30 containing a simulation data sequence file with a determined number of introns; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP in the Tau and APP mouse model research of the Alzheimer disease accelerating drug cooperation project, wherein the sequencing depth is one hundred million, and the reading length is 101 basic groups;
s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template; the method is defined by the following steps:
A. extracting all Independent intron sets independentintron from the annotated gtf file of release-75 version of the GRCm38 mouse genome; the independent intron is defined as an intron that does not overlap with any of the exons of the same type;
wherein, extracting all Independent intron sets independentlntron, specifically combining all exons in a chromosome, and then deleting all exons from a gene region, thereby obtaining all Independent introns;
B. combining introns with overlapped coordinate intervals in the Independent intron set Independent _ intron obtained in the step A by taking the gene as a unit to obtain a final Independent intron set intron cluster;
s3, acquiring a picture data set with a reading distribution mode of the subsequence with the built-in subsequence set in the simulation data obtained in the step S1, and preprocessing the picture data set to obtain a processed data set; specifically, the following steps are adopted to obtain a data set and perform data:
a. performing IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in the step S1 to obtain a primary visualized image;
b. because the length of each intron is indefinite and the difference is extremely large, two segments of sequence visualization images with the length of 40 bases are respectively stored at the left and right of the 5 'end and the 3' end of each intron; the height of the visualized image is 100mm, and meanwhile, the height of a bar graph representing the abundance of the base is subjected to standardization processing;
c. for the image obtained in the step b, the original longitudinal length of the single-segment sequence of the visual images is 621 pixels, the transverse length of the single-segment sequence is 1150 pixels, and therefore the part of the whole image with the longitudinal length of 131-231 pixels and the part with the transverse length of 280-1070 pixels are cut;
d. c, transversely combining the images obtained by cutting in the step c to obtain a final processed data set; the visualization results are shown in fig. 2;
s4, dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion; specifically, in the simulation data sequence file SIMU30 obtained in step S1, it is defined that the total sequence read is greater than a first set value (e.g., 10), FPKM (number of Fragments matched to each kilobase in a gene Per million reads, Fragments Per thousand base million) is greater than a second set value (e.g., 0.3), and the intron whose consecutive read is greater than a third set value (e.g., 1) is a positive sample, and the remaining introns are negative samples; then randomly extracting X2 (such as 5000) positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a testing set according to a set proportion (such as 7: 3); x2 is a positive integer.
S5, training the neural network model by adopting the training set obtained in the step S4, so as to obtain a finally established neural network intron retention prediction model; in particular, the prediction model is preferably a VGG16 model; and when the VGG16 is selected as the prediction model, the following steps can be adopted to train the model:
(1) obtaining a VGG16 network structure model (as shown in FIG. 3) which is trained on an ImageNet task and a corresponding weight parameter file; the network structure model worker comprises 13 convolution layers;
(2) loading the network and the weight obtained in the step (1) as a pre-training network, but freezing the network so as to ensure that the network does not participate in training;
(3) defining a two-classification network, and training on the training set obtained in the step S4; the two classification networks have 3 layers in total, the first 2 layers are full connection layers, the number of the neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively. The last layer is a sigmoid layer and is used for secondary classification;
(4) after the classification network is trained, unfreezing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) parameters for the model training process were set as follows:
the total parameter number of model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrained parameters is 700 thousand;
the loss function is binary cross entropy loss, and the calculation formula is
Figure BDA0002401015310000151
Where i is each sample, tiIs the true label of sample i; y isiIs a prediction tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e-5The number of iterations is 30;
the evaluation index is accuracy, and the calculation formula is as follows:
Figure BDA0002401015310000161
wherein Truepositive is the number of samples predicted to be positive and truly positive; turenergive is the number of samples predicted to be negative and truly negative; allsampies is the total number of samples;
setting the monitoring learning rate of reduceLROnPateau for every 2 iterations, and if the monitoring learning rate is not reduced, reducing the adjusting learning rate by 50%;
if the evaluation index accuracy is not reduced in 10 iterations, stopping the iteration in advance
S6, according to the prediction model of the intron retention in the neural network obtained in the step S5, calculating evaluation parameters (preferably AUC values) of the prediction model of the intron retention in the neural network on the test set obtained in the step S4;
s7, acquiring an image test set of the reading distribution mode of the intron subsequence of the real data obtained in the step S1; specifically, the sequence file APP of the real data obtained in the step S1 is input into a prediction tool iREAD and a prediction tool IRFinder, and two groups of intron retention prediction sets IR1 and IR2 are obtained respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; then, carrying out operations such as IGV visualization, picture cutting, merging and the like on each intron coordinate in the intersection IC, thereby obtaining an intron subsequence reading distribution mode picture test set real _ test of real data;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the prediction model of the intron retention in the neural network obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring a 5' end sequence of W1+ N1 bases on the exon side W1 bases and the intron side N1 bases of the predicted intron retention set obtained in the step S8;
s10, acquiring a 3' end sequence of W2+ N2 bases on the exon side W2 bases and the intron side N2 bases of the predicted intron retention set obtained in the step S8;
s11, calculating the splice site intensity according to the 5 'end sequence of W1+ N1 bases obtained in the step S9 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value; specifically, the 5 'end sequence score5ss sequence set obtained in the step S9 and the 3' end sequence score3ss sequence set obtained in the step S10 are input into a MaxEntScan model, and a maximum entropy model is adopted for scoring, so that a given splice site intensity value is obtained; then, averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value;
s12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11; specifically, if the value of the 5 'end average splice site intensity and the value of the 3' end average splice site intensity of the intron retention prediction model of the neural network are smaller, the prediction effect of the intron retention prediction model of the neural network is better;
and S13, predicting the intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.

Claims (14)

1. An intron retention prediction model building method comprises the following steps:
s1, collecting analog data and real data related to intron retention;
s2, defining all independent inclusion subsets in a genome to be combined to serve as a standard template;
s3, acquiring a picture data set with a reading distribution mode of the subsequence with the built-in subsequence set in the simulation data obtained in the step S1, and preprocessing the picture data set to obtain a processed data set;
s4, dividing the processed data set obtained in the step S3 into a training set and a test set according to a set proportion;
and S5, training the neural network model by adopting the training set obtained in the step S4, thereby obtaining the finally established neural network intron retention prediction model.
2. The method for building the intron-retention prediction model according to claim 1, further comprising the steps of:
s6, calculating evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step S4 according to the neural network intron retention prediction model obtained in the step S5;
s7, acquiring an image test set of the reading distribution mode of the intron subsequence of the real data obtained in the step S1;
s8, predicting an intron retention result on the test set obtained in the step S7 according to the prediction model of the intron retention in the neural network obtained in the step S5, so as to obtain a predicted intron retention set;
s9, acquiring a 5' end sequence of W1+ N1 bases on the exon side W1 bases and the intron side N1 bases of the predicted intron retention set obtained in the step S8;
s10, acquiring a 3' end sequence of W2+ N2 bases on the exon side W2 bases and the intron side N2 bases of the predicted intron retention set obtained in the step S8;
s11, calculating the splice site intensity according to the 5 'end sequence of W1+ N1 bases obtained in the step S9 and the 3' end sequence of W2+ N2 bases obtained in the step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value;
and S12, evaluating the intron retention prediction model of the neural network established in the step S5 according to the 5 'end average splice site intensity value and the 3' end average splice site intensity value obtained in the step S11.
3. The method for building the prediction model for intron retention according to claim 2, wherein the step S1 is performed by collecting simulation data and real data related to intron retention, specifically, by using BEER algorithm to generate the simulation data sequence file SIMU30 containing the determined number of introns; the sequencing depth of the simulation data sequence file SIMU30 is thirty million, the reading length is 100 bases, 15000 genes are set to be generated, and 69338 introns are set; and a real data sequence file APP from the study of Tau and APP mouse models of Alzheimer's disease accelerated drug cooperation projects, the sequencing depth is one hundred million, and the reading length is 101 bases.
4. The method according to claim 3, wherein the step S2 is to combine all independent intron subsets in the defined genome as standard templates, and the method is defined by the following steps:
A. extracting all Independent intron sets independentintron from the annotated gtf file of release-75 version of the GRCm38 mouse genome; the independent intron is defined as an intron that does not overlap with any of the exons of the same type;
B. and D, combining the introns with the overlapped coordinate intervals by taking the gene as a unit in the Independent intron set Independent _ intron obtained in the step A to obtain the final Independent intron set intron cluster.
5. The method for building an intron-retention prediction model according to claim 4, wherein all Independent intron sets independentintron are extracted in step A, specifically, all exons in one chromosome are combined, and then all exons are deleted from the gene region, so as to obtain all Independent introns.
6. The method of claim 5, wherein the step S3 of obtaining the picture data set with the reading distribution pattern of the intron sequence set in the simulation data obtained in the step S1 and preprocessing the picture data set to obtain a processed data set comprises the following steps of:
a. performing IGV visualization on each intron in the simulation data sequence file SIMU30 obtained in the step S1 to obtain a primary visualized image;
b. respectively storing two segments of sequence visual images of 20 bases on the left and right of the 5 'end and the 3' end of each intron, wherein the length of the two segments of sequence visual images is 40 bases; the height of the visualized image is 100mm, and meanwhile, the height of a bar graph representing the abundance of the base is subjected to standardization processing;
c. b, cutting a part with the length of 131-231 pixels and a part with the length of 280-1070 pixels of the whole image for the image obtained in the step b;
d. and c, transversely combining the images obtained by cutting in the step c to obtain a final processed data set.
7. The method of claim 6, wherein the step S4 is to divide the processed data set obtained in step S3 into a training set and a testing set according to a predetermined ratio, and specifically, in the simulated data sequence file SIMU30 obtained in step S1, the introns with the sequence total reading greater than the first set value, the FPKM greater than the second set value and the continuous reading greater than the third set value are defined as positive samples, and the remaining introns are negative samples; then randomly extracting X2 positive samples and X2 negative samples from the positive samples and the negative samples to form a final data set; then dividing the data set into a training set and a test set according to a set proportion; x2 is a positive integer.
8. The method according to claim 7, wherein the neural network model of step S5 is a VGG16 network structure model.
9. The method of claim 8, wherein the step S5 of training the neural network model with the training set obtained in step S4 to obtain the finally established neural network intron-retention prediction model comprises the following steps:
(1) obtaining a VGG16 network structure model which is trained on an ImageNet task and a corresponding weight parameter file; the network structure model worker comprises 13 convolution layers;
(2) loading the network and the weight obtained in the step (1) as a pre-training network, but freezing the network so as to ensure that the network does not participate in training;
(3) defining a two-classification network, and training on the training set obtained in the step S4; the two classification networks have 3 layers in total, the first 2 layers are full connection layers, the number of the neurons is 256 and 64 respectively, a Dropout layer is connected behind each layer to prevent overfitting, and the probability of randomly discarding the neurons is set to be 0.5 and 0.3 respectively; the last layer is a sigmoid layer and is used for secondary classification;
(4) after the classification network is trained, unfreezing the back 3 layers of convolution layers of the pre-training network, training the classification network and the pre-training network together by using the training set obtained in the step S4 again, and adjusting the weight;
(5) parameters for the model training process were set as follows:
the total parameter number of model training is 3300 ten thousand, wherein the number of trainable parameters is 2600 ten thousand, and the number of untrained parameters is 700 thousand;
the loss function is binary cross entropy loss, and the calculation formula is
Figure FDA0002401015300000041
Where i is each sample, tiIs the true label of sample i; y isiIs a prediction tag for sample i;
the optimizer is RMSprop, and the learning rate is 2e-5The number of iterations is 30;
the evaluation index is accuracy, and the calculation formula is as follows:
Figure FDA0002401015300000051
wherein Truepositive is the number of samples predicted to be positive and truly positive; turenergive is the number of samples predicted to be negative and truly negative; allsampies is the total number of samples;
setting the monitoring learning rate of reduceLROnPateau for every 2 iterations, and if the monitoring learning rate is not reduced, reducing the adjusting learning rate by 50%;
and if the evaluation index accuracy does not decrease in 10 iterations, stopping the iteration in advance.
10. The method for building the intron retention prediction model according to claim 9, wherein the step S6 is to calculate the evaluation parameters of the neural network intron retention prediction model on the test set obtained in the step S4, specifically calculate the AUC value of the neural network intron retention prediction model on the test set obtained in the step S4.
11. The method for building the intron retention prediction model according to claim 10, wherein the step S7 of obtaining the picture test set of the intron sequence reading distribution pattern of the real data obtained in the step S1 is to input the sequence file APP of the real data obtained in the step S1 into a prediction tool ired and a prediction tool IRFinder, so as to obtain two groups of intron retention prediction sets IR1 and IR2, respectively; mapping the IR1 and the IR2 to an independent intron set intron cluster according to a rule that the length of a matching coordinate interval is maximum, and then taking the intersection of the two to obtain an intersection IC; and then, carrying out IGV visualization, picture cutting and merging operation on each intron coordinate in the intersection IC, thereby obtaining the real data intron subsequence reading distribution mode picture test set real _ test.
12. The method for building an intron-retention prediction model according to claim 11, wherein the splice site intensities in step S11 are calculated according to the 5 'end sequence of W1+ N1 bases obtained in step S9 and the 3' end sequence of W2+ N2 bases obtained in step S10, so as to obtain a 5 'end average splice site intensity value and a 3' end average splice site intensity value, specifically, the 5 'end sequence score5ss sequence set obtained in step S9 and the 3' end sequence score3ss sequence set obtained in step S10 are inputted into a MaxEntScan model, and scored using a maximum entropy model, so as to obtain a given splice site intensity value; and then averaging the splice site intensities corresponding to the 5 'end sequence and the 3' end sequence to obtain a final 5 'end average splice site intensity value and a final 3' end average splice site intensity value.
13. The method according to claim 12, wherein the neural network intron retention prediction model established in step S5 is evaluated according to the 5 'end average splice site strength value and the 3' end average splice site strength value obtained in step S11 in step S12, and specifically, if the 5 'end average splice site strength value and the 3' end average splice site strength value of the neural network intron retention prediction model are smaller, the prediction effect of the neural network intron retention prediction model is better.
14. A prediction method comprising the method for establishing the intron retention prediction model according to any one of claims 1 to 13, and specifically comprising the steps of:
and S13, predicting the intron retention result by adopting the neural network intron retention prediction model obtained in the step S5.
CN202010146731.2A 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof Active CN111370055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010146731.2A CN111370055B (en) 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010146731.2A CN111370055B (en) 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof

Publications (2)

Publication Number Publication Date
CN111370055A true CN111370055A (en) 2020-07-03
CN111370055B CN111370055B (en) 2023-05-23

Family

ID=71208615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010146731.2A Active CN111370055B (en) 2020-03-05 2020-03-05 Intron retention prediction model establishment method and prediction method thereof

Country Status (1)

Country Link
CN (1) CN111370055B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220082545A (en) * 2020-12-10 2022-06-17 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by dete cting intron retention using transcriptome analysis
WO2023238973A1 (en) * 2022-06-10 2023-12-14 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by detecting intron retention using transcriptome analysis

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999066302A2 (en) * 1998-06-17 1999-12-23 Musc Foundation For Research Development Recognition of protein coding regions in genomic dna sequences
US20030077586A1 (en) * 2001-08-30 2003-04-24 Compaq Computer Corporation Method and apparatus for combining gene predictions using bayesian networks
WO2008097632A2 (en) * 2007-02-08 2008-08-14 Jiv An Biologics, Inc. Methods for determining splice variant types and amounts
US20120185172A1 (en) * 2011-01-18 2012-07-19 Barash Joseph Method, system and apparatus for data processing
CN105975809A (en) * 2016-05-13 2016-09-28 万康源(天津)基因科技有限公司 SNV detection method affecting RNA splicing
CN107849547A (en) * 2015-05-16 2018-03-27 建新公司 The gene editing of deep intragenic mutation
CN110010201A (en) * 2019-04-16 2019-07-12 山东农业大学 A kind of site recognition methods of RNA alternative splicing and system
WO2019226804A1 (en) * 2018-05-23 2019-11-28 Envisagenics, Inc. Systems and methods for analysis of alternative splicing
CN110800062A (en) * 2017-10-16 2020-02-14 因美纳有限公司 Deep convolutional neural network for variant classification

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999066302A2 (en) * 1998-06-17 1999-12-23 Musc Foundation For Research Development Recognition of protein coding regions in genomic dna sequences
US20030077586A1 (en) * 2001-08-30 2003-04-24 Compaq Computer Corporation Method and apparatus for combining gene predictions using bayesian networks
WO2008097632A2 (en) * 2007-02-08 2008-08-14 Jiv An Biologics, Inc. Methods for determining splice variant types and amounts
US20120185172A1 (en) * 2011-01-18 2012-07-19 Barash Joseph Method, system and apparatus for data processing
CN107849547A (en) * 2015-05-16 2018-03-27 建新公司 The gene editing of deep intragenic mutation
CN105975809A (en) * 2016-05-13 2016-09-28 万康源(天津)基因科技有限公司 SNV detection method affecting RNA splicing
CN110800062A (en) * 2017-10-16 2020-02-14 因美纳有限公司 Deep convolutional neural network for variant classification
WO2019226804A1 (en) * 2018-05-23 2019-11-28 Envisagenics, Inc. Systems and methods for analysis of alternative splicing
CN112912961A (en) * 2018-05-23 2021-06-04 恩维萨基因学公司 Systems and methods for analyzing alternative splicing
CN110010201A (en) * 2019-04-16 2019-07-12 山东农业大学 A kind of site recognition methods of RNA alternative splicing and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HONG-DONG LI等: "iREAD: a tool for intron retention detection from RNA-seq data" *
邢永强;张利绒;罗辽复;陈伟;: "老鼠基因组盒式外显子和内含子保留型可变剪接位点预测" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220082545A (en) * 2020-12-10 2022-06-17 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by dete cting intron retention using transcriptome analysis
KR102605084B1 (en) 2020-12-10 2023-11-24 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by dete cting intron retention using transcriptome analysis
WO2023238973A1 (en) * 2022-06-10 2023-12-14 중앙대학교 산학협력단 Method for diagnosing degenerative brain disease by detecting intron retention using transcriptome analysis

Also Published As

Publication number Publication date
CN111370055B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN110785814A (en) Predicting quality of sequencing results using deep neural networks
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN106909901A (en) The method and device of detection object from image
EP2320343A2 (en) System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map
US20190180844A1 (en) Method for deep learning-based biomarker discovery with conversion data of genome sequences
CN112687327B (en) Cancer survival analysis system based on multitasking and multi-mode
US11461584B2 (en) Discrimination device and machine learning method
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
CN111370055A (en) Intron retention prediction model establishing method and prediction method thereof
JP2014505935A (en) DNA sequence data analysis method
CN110110663A (en) A kind of age recognition methods and system based on face character
CN113228191A (en) System and method for identifying chromosomal abnormalities in embryos
CN114822698B (en) Knowledge reasoning-based biological large sample data set analysis method and system
CN112669899A (en) 16S and metagenome sequencing data correlation analysis method, system and equipment
CN111180013B (en) Device for detecting blood disease fusion gene
EP4016533A1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
CN111045920B (en) Workload-aware multi-branch software change-level defect prediction method
CN116959585A (en) Deep learning-based whole genome prediction method
CN114446393B (en) Method, electronic device and computer storage medium for predicting liver cancer feature type
CN115831219A (en) Quality prediction method, device, equipment and storage medium
CN115167965A (en) Transaction progress bar processing method and device
CN113782092A (en) Method and device for generating life prediction model and storage medium
KR102072894B1 (en) Abnormal sequence identification method based on intron and exon
EP3971902A1 (en) Base mutation detection method and apparatus based on sequencing data, and storage medium
CN114705148B (en) Road bending point detection method and device based on secondary screening

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant