CN113450875B - BRNN model and statistical test-based RNA m6A modification site identification method - Google Patents

BRNN model and statistical test-based RNA m6A modification site identification method Download PDF

Info

Publication number
CN113450875B
CN113450875B CN202110704166.1A CN202110704166A CN113450875B CN 113450875 B CN113450875 B CN 113450875B CN 202110704166 A CN202110704166 A CN 202110704166A CN 113450875 B CN113450875 B CN 113450875B
Authority
CN
China
Prior art keywords
site
reads
modification
statistical test
identifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110704166.1A
Other languages
Chinese (zh)
Other versions
CN113450875A (en
Inventor
王帅
许志晖
王海彦
王舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Agricultural University
Original Assignee
Nanjing Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Agricultural University filed Critical Nanjing Agricultural University
Priority to CN202110704166.1A priority Critical patent/CN113450875B/en
Publication of CN113450875A publication Critical patent/CN113450875A/en
Application granted granted Critical
Publication of CN113450875B publication Critical patent/CN113450875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Computation (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses an RNA m based on BRNN model and statistical test 6 A, an identification method of modification sites is combined with an eLIP experimental procedure to construct a mecLIP standard library and a library with simplified procedure; analysis of mecip data at m 6 After the truncation and mutation features upstream and downstream of the A site, related software was developed for identifying m at the single base level 6 A site, and more simply and rapidly prepares m 6 A single site identification library.

Description

RNA m based on BRNN model and statistical test 6 Identification method of A modification site
Technical Field
The present invention relates to an RNAm 6 A modification site identification method, in particular to an RNA m based on BRNN model and statistical test 6 A method for identifying A modification site.
Background
m 6 The A modification is widely distributed on mRNA and lncRNA, has various biological functions, and relates to multiple aspects of mRNA shearing, nuclear transport, stability, translation efficiency and the like. However, m 6 A is not as studied as 5mC or m 5 Like C, unmethylated cytosines can be converted to uracils using bisulfite treatment to accurately identify and quantify methylation modification levels by identifying base transitions in sequencing results. With respect to m 6 Development of A functional studies on how to correctly identify m at the single base level 6 The A site has become a major factor limiting intensive research.
M at single base level 6 The high throughput methods identified in a have three main types: MICLIP-seq (m) 6 Aindivididual-nucleotide resolution cross-linking and immunoprecipitation), mazter-seq (MazF digestion), and DART-seq (deamination adjacent to RNA modification targets). Wherein the Mazter-seq is limited to the MazF cleavage site ACA sequence, and the DART-seq is limited by the recognition of the m6A pattern sequence by the fusion protein. The MICLIP-seq is not limited by cleavage efficiency and fusion protein recognition site, and can theoretically recognize all possible m6A sites based on the position of truncations and mutations generated by amino acid residues. However, the complex procedure reduces the recovery of final RNA and library complexity, and requires an increase in the number of PCR amplification cycles to compensate, resulting in very little data available after sequencing.
Disclosure of Invention
The invention aims to: in order to overcome the defects in the prior art, the invention aims to provide an RNA m based on a BRNN model and a statistical test 6 The identification method of A modification site can accurately identify m at single base level aiming at the meclIP data 6 A modification site.
The technical scheme is as follows: RNA m based on BRNN (bidirectional recurrent neural network) model and statistical test 6 A method for identifying A modification site comprises the following steps,
(1) Selecting samples, and dividing the samples into a training set, a verification set and a test set;
(2) Constructing a bi-directional recurrent neural network for use with and without m 6 The sequence of A is classified;
(3) Training the constructed bidirectional circulating neural network by utilizing the sharp points and the mutation characteristics of sequences in the training set and the verification set, and optimizing the parameters of a neural network model;
(4) Testing the optimized bidirectional circulation network by using the base sequence samples in the test set, and counting the identification result;
(5) Will be identified as protection m 6 Combining the base sequences of A according to the difference of the coverage of the edges reads;
(6) Statistical test on the cut-off number of A and 1nt downstream thereof to obtain m 6 And A, identifying the unit point.
In step (1), m is extracted in RIP-seq (methyl-RNA immunoprecipitation) 6 CTKtools (CLIP tools kit) identified m in A modification interval 6 Sequence truncation and mutation information around the A site are taken as positive examples, sequences which do not contain A are randomly selected as negative examples,
negative samples were randomly extracted from the protein-encoding genes and met the following requirements: 1) Not in the identified m 6 A site of 10 nt; 2) m is m 6 A was not contained within 10nt upstream and downstream of the A site.
Wherein, m is used for composing positive examples and identified by CTKtools 6 The a site needs to meet the following two conditions: 1) M at the meRIP 6 The modification interval A is within; 2) Statistics of corresponding m of meRIP data 6 The coverage of IP reads and Input reads in the A modification interval is reserved, and m of IP reads/Input reads is more than or equal to 2 is reserved 6 A site A.
The identification method further comprises the steps of extracting and standardizing sample characteristics, counting the number of reads truncations and mutations of positive and negative samples in the MECLIP-gelfree data, and reserving if the number of reads truncations or mutations on the sample sequence is greater than 11.
In step (5), by calculating m 6 And merging the reads coverage on the left and right boundaries of the sequence A.
The two-way circulation network is constructed and trained by using samplesM identified from CTKtools 6 The sequences surrounding the A site and the sequences not containing A were used as positive and negative examples, respectively. The cut-off and mutation characteristics between the two sequences are relatively close, so that a neural network model with relatively high classification accuracy is relatively difficult to distinguish and train. Therefore, on the basis of the method, the statistical test is carried out on the truncated characteristics of the A and the downstream 1nt thereof to further improve the identification accuracy of the m6A unit.
The principle of the invention: the method for testing the meCLIP does not use radioactive element development, and reduces the requirement of the prior art on the test conditions. The process does not remove free RNA, and reduces the requirement on a sample thereof. However, these RNA fragments affect the identification of the m6A site, producing false positive results. In view of the characteristics of the library data, the invention uses a bidirectional circulation network to judge the sequence containing m6A, thereby reducing the influence of free RNA on the subsequent single-point judgment with high probability. Finally, the single base resolution is achieved by detecting the cut-off significance of A and 1nt downstream of A.
The beneficial effects are that: the invention combines the eLIP (enhanced cross-linking and immunoprecipitation) experimental flow to construct the meclIP (m) 6 A enhanced cross-linking and immunoprecipitation) standard library and a simplified procedure library; analysis of mecip data at m 6 After the truncation and mutation features upstream and downstream of the A site, related software was developed for identifying m at the single base level 6 A site, and more simply and rapidly prepares m 6 A single site identification library.
In view of the shortcomings of too complex preparation flow of the MICLIP library and small available data, the invention combines the reported eCLIP experimental flow to prepare the meCLIP-gel and meCLIP-gelfree libraries, and analyzes and reveals the cutting and mutation characteristics of m6A locus in the type of library. Aiming at the characteristics of the meclIP data, the invention develops a method combining BRNN and statistical test for identifying m 6 A site is the subsequent m 6 And A, a foundation is laid for related research.
Drawings
FIG. 1 is a flow chart of the single base m6A site identification principle of the present invention.
FIG. 2 is a graph of m 6 Class AA schematic diagram of the architecture of the device; wherein, the graph A is m 6 A working flow of the classifier is shown in a diagram B, and the classifying capability of the classifier under different architectures and parameters is shown in a diagram B.
FIG. 3 is an evaluation of the analysis results of recurrent neural network junctions and statistical tests; FIG. A is a comparative m 6 Reads coverage of cluster a results, light grey: no cluster of m6A was identified; deep gray: identifying a cluster of m 6A; FIG. B is m 6 Saturation curve of a site.
FIG. 4 is m 6 Distribution characteristics of the A site and a repetitive result among samples; wherein, the graph A is m 6 Distribution density of a over transcripts; the light curve is the library prepared by MaximaH, and the dark curve is the library prepared by Superscript III; FIG. B is m 6 Distribution characteristics of a at the transcriptome level; panel C is m common to the meCLIP-gelfree data 6 A site A; FIG. D is m 6 A site and m 6 Intersection of apeaks, MH represents MaximaH; SS denotes Superscript III.
Detailed Description
The present invention will be described in further detail with reference to examples.
RNA m based on BRNN model and statistical test 6 The method for identifying the A modification site comprises the following steps:
1. preparation of meCLIP library
(1) mRNA was extracted and incubated with antibody;
(2) Tiling mRNA incubated with the antibody into a 6-hole cell culture dish, and transferring to an ultraviolet crosslinking instrument;
(3) Ultraviolet light of 254mm wavelength was used at 300mJ/cm 2 After 2 times of energy crosslinking, flushing twice with fastpap buffer;
(4) Adding cross-linked mRNA fragments under a FastAP reaction system, incubating for 15min at 37 ℃, and removing phosphate groups at the 5 'and 3' ends;
preparation of FastAP reaction system:
(5) Adding the crosslinked mRNA fragments into a T4 PNK reaction system, and incubating for 20min at 37 ℃;
preparing a T4 PNK reaction system:
(6) After washing the cross-linked mRNA fragments 5 times with IP buffer, washing 2 times with 1x RNA ligase buffer, rapid centrifugation, removing residual buffer;
(7) Adding the mRNA fragment into an RNA ligase system, incubating for 75min at room temperature, and adding a 3' -terminal connector;
RNA linker sequence: ACAAGCCAGAUCGGAAAGGCGUCGUGUAG
The RNA-linked enzyme reaction system is as follows:
(8) Adding 7.5 μl of 4 XNuPAGE buffer, 3 μl of 1M DTT and 20 μl of 1 XRNA protective agent, shaking, mixing, and incubating at 70deg.C for 10min;
(9) Adding standard samples into holes 2, 4, 6 and 8 of the NuPAGE gel, adding 1/10 sample into hole 3 as a control, adding 9/10 sample into hole 7, and leaving the rest holes empty to prevent pollution;
(10) 150V voltage, electrophoresis for 75min or until the marking dye runs to the bottom of the gel;
(11) Taking out the NuPAGE gel, placing the NuPAGE gel into membrane transferring liquid, balancing for 10min, assembling the membrane transferring clamp with Whatman filter paper and NC membrane, and then carrying out 120V electrophoresis for 90min;
(12) Taking out NC membrane, cutting the membrane above 50kDa position with blade at 7 # hole position, cutting into slices of about 1-2mm size, and placing into a new enzyme-free 1.5ml EP tube;
(13) 200 μl proteinase K is added into the EP tube, and after sufficient shaking, the antibody is digested by incubation for 20min at 37deg.C;
(14) Adding 200 μl urea solution prepared from proteinase K buffer solution, shaking thoroughly, mixing well, incubating at 37deg.C for 20min, and dissociating mRNA;
(15) Sucking 400. Mu.l of the mixed solution of proteinase K and mRNA, transferring the mixed solution to a new EP tube, adding 400. Mu.l of the mixed solution of phenol, chloroform and isoamyl alcohol, shaking and mixing the mixed solution evenly, and incubating the mixed solution at 37 ℃ for 5min;
(16) The EP tube was placed in a centrifuge and centrifuged at 13000g for 15min to separate mRNA;
(17) The 80% supernatant was aspirated and transferred to a new 1.5ml EP tube;
(18) Adding 40 μl of 3M sodium acetate, 2.5 times of absolute ethanol and 2.5 μl of glycogen, shaking, mixing, and standing overnight at-80deg.C;
(19) Taking out a sample precipitated overnight, turning over and mixing uniformly, and centrifuging at 15,000g for 25min; the supernatant was discarded, 1ml of 75% ethanol was added for washing, and the mixture was centrifuged at 15,000g for 15min; discarding ethanol, standing at room temperature for 5min to volatilize the rest ethanol, and adding 20 μl of ultrapure water to dissolve and precipitate;
(20) Mixing 10 μl of mRNA and 0.5 μl of AR17 primer, and incubating in a metal bath at 65deg.C for 5min;
(21) Adding a mixture of mRNA and AR17 into the inversion system, and then placing the mixture into a PCR instrument for incubation at 55 ℃ for 45min, and inverting one strand of cDNA;
AR17 linker: ACACGAGGCTCCTTCCGA
cDNA one strand inversion system:
(22) Adding 2 μl RNaseA/H mixture, and incubating at 37deg.C for 15min; then adding 3 μl of 1M NaOH to react for 15min at 70 ℃ to remove template RNA;
(23) Adding 10 μl of MyONE beads, blowing and mixing, standing for 5min, and removing supernatant with a magnetic rack; after 3 washes with 80% ethanol, the cDNA on the beads was eluted with 5. Mu.l of Tris-HCl pH 7.5 at 5mM and transferred to a new 0.2ml PCR tube;
(24) Adding a random 10N connector into the PCR tube, and heating the PCR tube for 2min at 75 ℃ in a metal bath to break a secondary structure;
(25) Preparing a cDNA two-chain synthesis system, adding the cDNA two-chain synthesis system into a PCR tube, and incubating overnight at room temperature;
cDNA two-strand synthesis System:
(26) Mu.l MyONE beads were added to purify the cDNA, and 25. Mu.l Tris-HCl, pH 7.5, 5mM was used to elute the cDNA;
(27) Taking 2 μl of cDNA as a template, qPCR amplifying for 40 cycles to identify the concentration;
qPCR reaction system:
(28) PCR amplification library, the cycle number is determined according to Ct value of qPCR amplification;
PCR reaction system:
(29) Purifying the PCR amplification product by using Ampure XP magnetic beads;
(30) After agarose electrophoresis, a band of 175-350bp is cut off to recycle the product;
(31) Agarose electrophoresis was repeated to determine if the library contained linker contamination; if there is linker contamination, the PCR amplification to agarose recovery step needs to be repeated.
2. Analysis of sequencing data
The meCLIP library was transcriptome sequenced using NovaSeq, read at 2X 150bp in length. Trim Galore removes the linker and base sequences with a mass below 20. Reads greater than 20nt in length after removal of the 7nt tag sequence and 10N random sequence from the mecip library were used for alignment.
According to the eLIP article method, only read2 of the mecLIP library was aligned using HISAT2 (v2.1.0). HISAT2 software parameters were adjusted to-no-softlip, -score-min-L, -0.6, -0.6 and-max-intronlen 20000, allowing more mismatches while reducing chimeric mis-alignments. Reads aligned to unique positions of the genome were extracted and PCR repeats were filtered based on the 10N random sequence at the beginning of read2 and aligned start and stop sites. And counting the 5' end position and mutation site information of reads after filtering.
3. Single base level m 6 Identification of A site
(1) Extraction of training samples
The training required samples comprise positive examples and negative examples, wherein the positive examples consist of two parts: m identified by CTKtools 6 M identified by A site and MazF 6 A site A.
M identified by CTKtools for constituting a positive example 6 The a site needs to meet the following two conditions: 1) M at the meRIP 6 The modification interval A is within; 2) Statistics of corresponding m of meRIP data 6 The coverage of IP reads and Input reads of the A modification interval is reserved, and the IP reads/Input reads are reserved>M of=2 6 A site A. These m are then added 6 The A site was divided into three parts according to the enrichment degree of reads (IP reads/Input reads), 4000 were extracted per part. M identified by MazF 6 The shearing rate of the A site is required to be less than 0.7, and the A site is divided into four groups according to the shearing rate from high to low, and each group is randomly extracted2500, the remaining sites were extracted on average from the other groups if some group was less than 2500 sites. To increase the number of training samples, we use 1nt as step length, at m 6 Extracting sequences with length of 11nt in the interval of 10nt at the upstream and downstream of the A locus to obtain 242000 m-contained sequences 6 Positive samples of site a.
Negative samples were randomly extracted from the protein-encoding genes, requiring two conditions: 1) Not in the identified m 6 A site of 10 nt; 2) m is m 6 A was not contained within 10nt upstream and downstream of the A site.
5000 samples not included in the training set were extracted as the test set according to the selection positive method.
(2) Sample feature extraction and normalization
And counting the number of reads truncations and mutations of positive and negative samples in the meCLIP-gelfree data, and if the number of reads truncations or mutations on the sample sequence is greater than 11, reserving. To reduce the effect of the differences in gene expression of the sample on the data characteristics, reads truncation and mutation at each base of the sequence require normalization according to equation 3.1.
(3) BRNN training
To avoid the problem of gradient extinction during training, a gating loop unit is used to construct BRNN. The input data to the neural network is a two-dimensional vector of length 11 containing truncations and mutations at each base. Network structures with 2, 4 and 8 hidden layers and 16, 32 and 64 neuron combinations were used for test screening of the best model. To avoid gradient explosion, the gradient clipping value is set to 1. As a classifier we only input the feature variables of the neuron outputs of the last hidden layer of the sequence into the fully connected layer. It outputs two variables, and obtains classification results after transformation using Softmax activation function. In addition, since training sample data are less, we also test that the converted feature variable output by the last hidden layer neuron is classified by using SVM, so as to obtain a more stable classification result.
(4) m is contained in mecip 6 Identification and pooling of A modification intervals
Extracting the sequences of 10nt upstream and downstream of all points A of the meclIP data, inputting the sequences into an optimal model found through training to predict and distinguish m 6 A sequence and non-m 6 A sequence A.
Calculating m 6 The coverage of reads on the left and right boundaries of the A sequence is only such that the ratio of the number of reads on the left and right boundaries of adjacent or overlapping sequences, separated by less than 10nt, is above 0.9.
(5)m 6 Identification of A site
At m 6 Using scan statistics on a-sequence clusters to identify significant m 6 A site A.
First assume that at one m 6 Cluster A [ alpha, beta ]]The truncation of the upper reads obeys poisson distribution and the average value of unit intervals isIn any interval [ x, x+w ]]The probability of occurrence of truncation above can be obtained from equation 3.2:
where Yx (w) represents the number of truncations over the interval of length w. The number of truncations is independently distributed over each subinterval, then the scan statistic is defined as:
S(w)=max(Yx(w);α≤x≤β-w) (3.3)
the probability of more than k truncations occurring in a subinterval of length w is:
wherein L is m 6 The length of cluster a. The probability of an event being less than k in one subinterval is:
the probability in equation 3.5 can be calculated from the approximation equation 3.6 derived by Naus:
wherein,and->The following two formulas may be used for calculation:
in equation 3.6
In the above-mentioned formula(s),
when k is<At the time of 0, the temperature of the liquid,
the method specifically comprises the following steps:
1.m 6 identification of A sequences
m 6 The mutation and truncation caused by A are mainly in the range of 5nt above and below it, thus selecting 11nt long and must contain a known m 6 The sequence of a was used to construct positive examples samples. Negative examples the selection of sequences without any A prevents the inclusion of m in the sequence that is not identified 6 A, and interference is generated. Positive and negative samples require data normalization and filtering before they are input into BRNN for training, requiring at least more than 11 truncations or mutations in these sequences.
Then, the truncations and mutations at each base of the training sample are sequentially input to BRNN (m can be extracted simultaneously 6 A upstream and downstream base information). Although the 11nt interval of the sample in the positive example necessarily contains a known m 6 A site, but other 10 bases are m 6 Possibility of A (m not identified by previous methods 6 A site a). This implicit state of the positive example sample can make it difficult to train the BRNN to achieve m at single base level based on the output of neurons 6 The purpose of A site identification. Therefore, we just consider BRNN as a classifier, and use the output value of the last neural network to classify the input sequence (determine that the interval is m-containing 6 A is also absent). Due toThe training sample data set is smaller, and in order to improve the classification accuracy, we extract the BRNN to obtain m after conversion 6 The A features are input into the SVM for classification, and the difference between the effect of the BRNN-SVM combined classifier and the result of classification by the BRNN is compared (figure 2A).
As can be seen from fig. 2B, the excessive number of neurons and hidden layers of the BRNN model may instead result in a reduced classification ability of the model. In a network architecture with 2 or 4 hidden layers, the number of neurons does not have much impact on the classification ability of the model. This is probably because of m 6 The small amount of information contained in the input features of the a-site results in only a certain number of neurons being needed to fit the features. The test sample classification accuracy was highest (up to 85.6%) among the BRNN models trained with 4 hidden layers and 32 neurons. To lift the model pair m 6 The classification ability of the a-sequence, the classification features extracted by BRNN were input into the SVM classifier for training, but the results showed no improvement in the classification effect (fig. 2B). When comparing the performance of classifiers using different linear kernel and gaussian radial kernel parameters, it was found that the classification ability of the SVM was better only when outliers were less considered. This means that even m 6 The features of the a site are transformed by BRNN, but there are some positive and negative examples where the features are too similar, resulting in difficulty in finding support vectors to distinguish between these sites.
Although the trained optimal BRNN model was used to predict the 11nt long A-containing sequence on the genome, the m was identified 6 A sequences have a certain error rate and have not been identified to a single base level, but can be reduced in having a sequence identical to m 6 Non-m of A sequence mutation and truncation of similar features 6 The interference of the A sequence improves the accuracy of the subsequent statistical test.
Taking into account m 6 The coverage of the A sequence is positively correlated with its methylation level (characteristic of co-immunoprecipitation experiments), if an adjacent m is identified 6 The A sequence coverage is similar, and the characteristics of mutation and truncation should be similar. Therefore, the predicted m are less than 10nt spacing distance and the difference of reads coverage of the edge is small 6 The a sequences are combined into clusters. 44838 and 80233 m were identified in two replicates of mecip-gelfree, respectively 6 Cluster a. Subsequent significance testing of the A site on these clusters and the number of truncations and mutations of 1nt downstream using scanning statistics to determine m at the single base level 6 A site (q value less than 0.05 identified as m 6 A site a).
13890 and 35777 m were identified in library data prepared from MaximaH and Superscript III, respectively 6 A site A, about 30% -45% of m 6 Cluster a can identify m 6 A site A. In view of this, whether there is m in the cluster after checking according to statistics 6 A site was classified into two classes, and the coverage of reads on clusters was analyzed with m 6 And A, relation of detection rate. FIG. 3A shows that m is not detected 6 The reads coverage of clusters of A sites is overall lower and is comparable to that of clusters with m 6 There was a significant difference in the reads coverage of clusters detected at the a site (p<2.2e -16 Wilcoxon ranked sum test). Although m at single base level can be identified using scanning statistics 6 A site, but this also results in only m with a high relative methylation level 6 The A site can be detected. The available data of the mecip-gelfree library prepared in Superscript III is more, so that it is extracted in multiple portions according to different ratios, and m is detected 6 The A site was subjected to saturation analysis. As can be seen from FIG. 3B, M is the time when there are only 4-8M reads 6 The detected number of A sites slowly increases with the increase of data; after the data exceeds 8M, M 6 The detected number of A sites increases rapidly; by the time the data exceeded 30M, the increase slowed, indicating that the detected M6A site was approaching saturation.
2. M identified using model 6 Analysis of A site
(1) Analysis of m6A site in mecLIP-gelfree data
M identified herein by the model 6 The reproducibility of the a site across samples and distribution over transcriptomes were compared.
As can be seen from FIG. 4A, the model identifies m 6 The A locus is mainly concentrated in 3' UTR, which accords with the current literature reported Arabidopsis m 6 A distribution feature. Statistics of different Gene signaturesM of region 6 Site a shows that both libraries have about 79% m 6 The A site is located in the 3'UTR, which is 18% or more on the CDS, whereas the 5' UTR is only about 2.5% (FIG. 4B). M identified in the two-part mecip-gelfree data 6 9222A sites are shared and account for m identified in the MaximaH-prepared meCLIP library data 6 More than 66% of the A site, the reproducibility of the results was better (FIG. 4C). These m are set 6 A site and a known m 6 The overlap of the A modification regions was found to be over 75% of the A modification regions located at m 6 Within the a modification interval (fig. 4D).

Claims (6)

1. RNA m based on BRNN model and statistical test 6 The method for identifying the A modification site is characterized by comprising the following steps of: comprises the steps of,
(1) Selecting samples, and dividing the samples into a training set, a verification set and a test set;
(2) Constructing a bi-directional recurrent neural network for use with and without m 6 The sequence of A is classified;
(3) Training the constructed bidirectional circulating neural network by utilizing the sharp points and the mutation characteristics of sequences in the training set and the verification set, and optimizing the parameters of a neural network model;
(4) Testing the optimized bidirectional circulation network by using the base sequence samples in the test set, and counting the identification result;
(5) Will be identified as protection m 6 Combining the base sequences of A according to the difference of the coverage of the edges reads;
(6) Statistical test on the cut-off number of A and 1nt downstream thereof to obtain m 6 And A, identifying the unit point.
2. RNA m based on BRNN model and statistical test as claimed in claim 1 6 The method for identifying the A modification site is characterized by comprising the following steps of: in step (1), m is extracted in RIP-seq 6 M identified by CTKtools within the A modification interval 6 Sequence truncation and mutation information around the A site are taken as positive examples, and sequences without A are randomly selected as negative examples.
3. RNA m based on BRNN model and statistical test according to claim 2 6 The method for identifying the A modification site is characterized by comprising the following steps of: negative samples were randomly extracted from the protein-encoding genes and met the following requirements: 1) Not in the identified m 6 A site of 10 nt; 2) m is m 6 A was not contained within 10nt upstream and downstream of the A site.
4. RNA m based on BRNN model and statistical test according to claim 2 6 The method for identifying the A modification site is characterized by comprising the following steps of: m identified by CTKtools for constituting a positive example 6 The a site needs to meet the following requirements: 1) M at the meRIP 6 The modification interval A is within; 2) Statistics of corresponding m of meRIP data 6 The coverage of IP reads and Input reads in the A modification interval is reserved, and m of IP reads/Input reads is more than or equal to 2 is reserved 6 A site A.
5. RNA m based on BRNN model and statistical test according to claim 2 6 The method for identifying the A modification site is characterized by comprising the following steps of: and the method also comprises the steps of extracting and standardizing sample characteristics, counting the number of reads truncations and mutations of positive and negative samples in the meCLIP-gelfree data, and reserving if the number of reads truncations or mutations on the sample sequence is more than 11.
6. RNAm according to claim 1 based on BRNN model and statistical test 6 The method for identifying the A modification site is characterized by comprising the following steps of: in step (5), by calculating m 6 And merging the reads coverage on the left and right boundaries of the sequence A.
CN202110704166.1A 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method Active CN113450875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110704166.1A CN113450875B (en) 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110704166.1A CN113450875B (en) 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method

Publications (2)

Publication Number Publication Date
CN113450875A CN113450875A (en) 2021-09-28
CN113450875B true CN113450875B (en) 2024-01-02

Family

ID=77812505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110704166.1A Active CN113450875B (en) 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method

Country Status (1)

Country Link
CN (1) CN113450875B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107385078A (en) * 2017-08-30 2017-11-24 浙江大学 Label of pig fat deposition description correlation PNPLA2 mRNA m6A methylate the authentication method and application of function of unit point
CN107419026A (en) * 2017-08-30 2017-12-01 浙江大学 Label of pig fat deposition description correlation UCP2 mRNA m6A methylate the authentication method and application of function of unit point
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3581711B2 (en) * 2002-07-30 2004-10-27 株式会社Tumジーン Base mutation detection method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107385078A (en) * 2017-08-30 2017-11-24 浙江大学 Label of pig fat deposition description correlation PNPLA2 mRNA m6A methylate the authentication method and application of function of unit point
CN107419026A (en) * 2017-08-30 2017-12-01 浙江大学 Label of pig fat deposition description correlation UCP2 mRNA m6A methylate the authentication method and application of function of unit point
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Also Published As

Publication number Publication date
CN113450875A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN107873054B (en) Droplet-based methods and apparatus for multiplexed single-cell nucleic acid analysis
CN103131754B (en) Method for detecting nucleic acid hydroxylmethylation modification, and application thereof
CN102682224B (en) Method and device for detecting copy number variations
CN105002567B (en) Simplify the construction method for the sequencing library that methylates without reference gene group high flux
CN105861710A (en) Sequencing joint and preparation method and application thereof in ultra-low frequency mutation detection
CN104480214A (en) Process for sequencing hydroxymethylation and methylation long-sequence tags
CN111755072B (en) Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN112941201A (en) Mixed primer for multi-cell species identification and cross contamination detection and use method thereof
CN109477132A (en) Ribonucleic acid (RNA) interaction
CN115198023A (en) Hainan cattle liquid phase breeding chip and application thereof
CN113450875B (en) BRNN model and statistical test-based RNA m6A modification site identification method
CN113355459B (en) Method and kit for detecting and screening N501Y mutation of new coronavirus
CN113337590A (en) Second-generation sequencing method and library construction method
CN113755593B (en) PCR amplification primer, kit and method for detecting HLA-A gene SNP marker rs1136697-G
CN109797437A (en) A kind of construction method of sequencing library when detecting multiple samples and its application
CN113699253A (en) Laoshan milk goat low-density liquid-phase SNP chip and application thereof
CN109280699A (en) A kind of new varieties identification method based on ddRAD method
CN106520959B (en) Development method of orchid microsatellite marker locus and method for detecting length of microsatellite marker in microsatellite marker locus
WO2024104130A1 (en) Whole genome molecular marker development method utilizing degenerate primer amplification
CN112941151B (en) Method for constructing EST-SSR (expressed sequence tag-simple sequence repeat) marker by using RNA-seq technology and application
WO2005094291A2 (en) Computational selection of probes for localizing chromosome breakpoints
CN106755314B (en) Development method of wheat microsatellite marker locus and length detection method of microsatellite marker in microsatellite marker locus
CN106520954B (en) Development method of cotton microsatellite marker locus and length detection method of microsatellite marker in microsatellite marker locus
CN116121437B (en) SNP (single nucleotide polymorphism) marker combination of mangiferin fruit variety and application of SNP marker combination in mangiferin fruit breeding
CN109371166B (en) Method for detecting difference expression of plant circRNA allelic loci in high throughput manner

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210043 Jiangsu Nanjing Qixia District Bagua Zhou street Jiangsu Qixia modern agriculture industrial park Nanjing Agricultural University modern horticulture industry science and Technology Innovation Center

Applicant after: NANJING AGRICULTURAL University

Address before: Weigang Xuanwu District of Nanjing Jiangsu province 210095 No. 1

Applicant before: NANJING AGRICULTURAL University

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant