CN113450875A - Identification method of RNA m6A modification site based on BRNN model and statistical test - Google Patents

Identification method of RNA m6A modification site based on BRNN model and statistical test Download PDF

Info

Publication number
CN113450875A
CN113450875A CN202110704166.1A CN202110704166A CN113450875A CN 113450875 A CN113450875 A CN 113450875A CN 202110704166 A CN202110704166 A CN 202110704166A CN 113450875 A CN113450875 A CN 113450875A
Authority
CN
China
Prior art keywords
site
reads
modification
identifying
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110704166.1A
Other languages
Chinese (zh)
Other versions
CN113450875B (en
Inventor
王帅
许志晖
王海彦
王舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Agricultural University
Original Assignee
Nanjing Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Agricultural University filed Critical Nanjing Agricultural University
Priority to CN202110704166.1A priority Critical patent/CN113450875B/en
Publication of CN113450875A publication Critical patent/CN113450875A/en
Application granted granted Critical
Publication of CN113450875B publication Critical patent/CN113450875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Computation (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses an RNA m based on a BRNN model and statistical test6Method for identifying A modification site, and kitConstructing a mecLIP standard library and a library of a simplified process by combining an eCIPL experimental process; in analyzing mecLIP data at m6Following truncation and mutation characterization upstream and downstream of the A site, relevant software was developed for identifying m at the single base level6A site, and more convenient and rapid preparation of m6A library for single site identification.

Description

RNA m based on BRNN model and statistical test6Method for identifying A modification site
Technical Field
The invention relates to a RNAm6A modification site identification method, in particular to an RNA m based on BRNN model and statistical test6And (3) a method for identifying the A modification site.
Background
m6The A modification is widely distributed on mRNA and lncRNA, has various biological functions, and relates to multiple aspects of mRNA shearing, transferring out of nucleus, stability, translation efficiency and the like. However, m6Study of A is unlike 5mC or m5In that case, bisulfite treatment can be used to convert unmethylated cytosines to uracil, thereby accurately identifying and quantifying the level of methylation modification by identifying base transitions in the sequencing results. With m being paired6Development of A function Studies, how to correctly identify m at the Single base level6The A site has become a major factor limiting the intensive research.
M at the level of a single base6There are three main high-throughput methods for a identification: miclIP-seq (m)6Air-nucleotide resolution cross-linking and immunoproportion), Mazter-seq (MazF digest), and DART-seq (deletion access to RNA modification targets). Wherein Mazter-seq is limited to the sequence of the enzyme cutting site ACA of MazF, and DART-seq is subjected to the recognition of m6A pattern sequence by fusion protein. The miCLIP-seq is not limited by the cleavage efficiency and the recognition site of the fusion protein, and theoretically all possible m6A sites can be recognized according to the position of truncation and mutation generated by the amino acid residue. However, the complex procedure reduces the recovery of final RNA and library complexity, and requires an increase in the number of PCR amplification cycles to compensate, resulting in little data available after sequencing.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention aims to provide an RNA m based on a BRNN model and a statistical test6The identification method of the A modification site can accurately identify m at the single base level aiming at mecLIP data6A modification site.
The technical scheme is as follows: one kind of the invention is based onBRNN (bidirectional temporal neural network) model and statistically examined RNA m6A method for identifying a modification site, comprising the steps of,
(1) selecting samples, and dividing the samples into a training set, a verification set and a test set;
(2) constructing a bidirectional recurrent neural network for use with and without m6Classifying the sequence of A;
(3) training the constructed bidirectional cyclic neural network by using the tips and the mutation characteristics of the training set and the verification set sequences, and optimizing parameters of a neural network model;
(4) testing the optimized bidirectional circulation network by using the base sequence samples in the test set, and counting the recognition result;
(5) will be identified as protection m6The base sequences of A are merged according to the difference of the coverage of edge reads;
(6) performing statistical test on the truncation number of A and 1nt downstream thereof to obtain m6And (C) identifying the A single site.
In the step (1), m in RIP-seq (methyl-RNA immunology) is extracted6M identified by CTKtools (CLIP tools kit) within the A modification region6Sequence truncation and mutation information around the A site is taken as a positive example, a sequence without A is randomly selected as a negative example,
negative samples were randomly drawn from the protein-encoding gene and met the following requirements: 1) m is not in the identified6The A site is within 10 nt; 2) m is6No A is contained in 10nt upstream and downstream of the A site.
Among them, the CTKtools identified m constituting the prime case6The a site needs to satisfy the following two conditions: 1) m at merrip6A is in a modification interval; 2) counting m corresponding to MeRIP data6The coverage of IP reads and Input reads in the A modification interval is maintained, and m of which the IP reads/Input reads is more than or equal to 2 is reserved6And (3) A site.
The identification method also comprises the steps of extracting and standardizing sample characteristics, counting the numbers of reads truncation and mutation of positive and negative samples in the mecLIP-gelfree data, and keeping the samples if the number of reads truncation or mutation on the sample sequence is more than 11.
In step (5), m is calculated6And merging the reads coverage on the left and right boundaries of the A sequence.
The samples used for constructing and training the bidirectional circulation network are m identified by CTKtools6Sequences around the A site and sequences without A were used as positive and negative samples, respectively. The truncation and mutation characteristics between the two sequences are relatively close, so that a neural network model with high classification accuracy rate is difficult to distinguish and train. Therefore, on the basis, the invention carries out statistical test on the truncation characteristics of A and 1nt downstream thereof to further improve the identification accuracy of m6A single sites.
The invention principle is as follows: the mecLIP experimental method does not use radioactive element development, and reduces the requirements of the prior art on experimental conditions. The free RNA is not removed in the process, so that the requirement on a real sample is reduced. However, these RNA fragments affected the identification of the m6A site, resulting in false positive results. In view of the characteristics of the library data, the invention uses a bidirectional circular network to judge the sequence containing m6A, thereby reducing the influence of free RNA on the subsequent single-site judgment with high probability. And finally, the resolution of the single base is achieved by detecting the significance of the truncation at A and 1nt downstream of A.
Has the advantages that: the invention combines eCIPL (enhanced cross-linking and immunopropractionation) experimental process to construct mecLIP (m)6A enhanced cross-linking and immunoproportion) standard libraries and libraries that simplify the protocol; in analyzing mecLIP data at m6Following truncation and mutation characterization upstream and downstream of the A site, relevant software was developed for identifying m at the single base level6A site, and more convenient and rapid preparation of m6A library for single site identification.
In view of the disadvantages of complicated preparation process of the miCLIP library and small available data amount, the invention combines the reported eCLIP experimental process to prepare the mecLIP-gel and mecLIP-gelfree libraries, analyzes and reveals the truncation and mutation characteristics of the m6A site in the type library. Aiming at the characteristics of mecLIP data, the invention develops a method for identifying m by combining BRNN and statistical test6A site, subsequent m6And laying a foundation for related research of A.
Drawings
FIG. 1 is a flow chart showing the principle of identifying the single-nucleotide m6A site according to the present invention.
FIG. 2 is a drawing showing a value of m6A, a schematic architecture diagram of a classifier; wherein, the graph A is m6The work flow of the classifier is shown in the A, and the diagram B is a schematic diagram comparing the classification capability of the classifier under different architectures and parameters.
FIG. 3 is an evaluation of the results of analysis of the recurrent neural network nodes and statistical tests; graph A is comparative m6Reads coverage results for cluster a, light gray: no clusters of m6A were identified; deep ash: identifying a cluster of m 6A; graph B is m6Saturation curve at a site.
FIG. 4 is m6Distribution characteristics of A sites and repeatability results among samples; wherein, the graph A is m6The distribution density of A on the transcript; the light curve is a library prepared by MaximaH, and the dark curve is a library prepared by Superscript III; graph B is m6A profile at the transcriptome level; panel C shows m shared by mecLIP-gelfree data6A site A; graph D is m6A site and m6Intersection of A peaks, MH denotes MaximaH; SS represents Superscript III.
Detailed Description
The present invention will be described in further detail with reference to examples.
RNA m based on BRNN model and statistical test6A method for identifying a modification site, comprising the steps of:
1. preparation of mecLIP library
(1) Extracting mRNA and incubating with antibody;
(2) flatly spreading mRNA incubated with the antibody to a 6-hole cell culture dish, and transferring to an ultraviolet cross-linking instrument;
(3) using 254mm wavelength UV light at 300mJ/cm2After 2 crosslinkings with energy, washed twice with FastAP buffer;
(4) adding cross-linked mRNA fragments in a FastAP reaction system, incubating for 15min at 37 ℃, and removing phosphate groups at 5 'and 3' ends;
preparation of a FastAP reaction system:
Figure BDA0003130518650000031
Figure BDA0003130518650000041
(5) adding the cross-linked mRNA fragment into a T4 PNK reaction system, and incubating for 20min at 37 ℃;
preparation of T4 PNK reaction system:
Figure BDA0003130518650000042
(6) washing the cross-linked mRNA fragment with IP buffer solution for 5 times, washing with 1 XRNA ligase buffer solution for 2 times, and rapidly centrifuging to remove residual buffer solution;
(7) adding the mRNA fragment into an RNA ligase system, incubating for 75min at room temperature, and adding a 3' end connector;
RNA linker sequence: ACAAGCCAGAUCGGAAAGGCGUCGUGUAG
The reaction system of RNA ligase is as follows:
Figure BDA0003130518650000043
(8) adding 7.5 μ l 4 XNuPAGE buffer, 3 μ l 1M DTT and 20 μ l 1 XRNA protectant, shaking, mixing, and incubating at 70 deg.C for 10 min;
(9) adding a standard sample into No. 2, 4, 6 and 8 wells of the NuPAGE gel, adding 1/10 sample into No. 3 well as a control, adding 9/10 sample into No. 7 well, and leaving the rest wells empty to prevent pollution;
Figure BDA0003130518650000044
Figure BDA0003130518650000051
(10) electrophoresis at 150V for 75min or until the marker dye runs to the bottom of the gel;
(11) taking out NuPAGE gel, placing in the membrane transferring solution, balancing for 10min, assembling the membrane transferring clamp, Whatman filter paper and NC membrane, and performing 120V electrophoresis for 90 min;
(12) taking out the NC membrane, cutting off the membrane above the 50kDa position at the position of the hole No. 7 with a blade, cutting into slices of about 1-2mm size, and placing into a new enzyme-free 1.5ml EP tube;
(13) adding 200 μ l proteinase K into EP tube, shaking thoroughly, and incubating at 37 deg.C for 20min to digest antibody;
(14) adding 200 μ l urea solution prepared from protease K buffer solution, shaking thoroughly, mixing, incubating at 37 deg.C for 20min, and dissociating mRNA;
(15) sucking 400 mul of mixed solution of proteinase K and mRNA, transferring the mixed solution to a new EP tube, adding 400 mul of mixed solution of phenol/chloroform/isoamylol, uniformly mixing by oscillation, and incubating for 5min at 37 ℃;
(16) placing the EP tube into a centrifuge, centrifuging for 15min at 13000g, and separating mRNA;
(17) aspirate 80% of the supernatant and transfer to a new 1.5ml EP tube;
(18) adding 40 μ l of 3M sodium acetate, 2.5 times of anhydrous ethanol and 2.5 μ l of glycogen, shaking, mixing, and refrigerating at-80 deg.C overnight;
(19) taking out the overnight precipitated sample, turning and mixing uniformly, and centrifuging for 25min at 15,000 g; discarding the supernatant, adding 1ml of 75% ethanol for washing, and centrifuging 15,000g for 15 min; removing ethanol, standing at room temperature for 5min to volatilize residual ethanol, and adding 20 μ l ultrapure water to dissolve precipitate;
(20) mixing 10. mu.l of mRNA with 0.5. mu.l of AR17 primer, and incubating for 5min in a metal bath at 65 ℃;
(21) adding a mixture of mRNA and AR17 into the inversion system, placing the inversion system into a PCR instrument, incubating at 55 ℃ for 45min, and inverting one strand of cDNA;
AR17 linker: ACACGAGGCTCCTTCCGA
cDNA one strand inversion system:
Figure BDA0003130518650000052
Figure BDA0003130518650000061
(22) adding 2 μ l RNaseA/H mixed solution, and incubating at 37 deg.C for 15 min; adding 3 μ l of 1M NaOH, reacting at 70 deg.C for 15min to remove template RNA;
(23) adding 10 μ l MyONE beads, blowing, mixing, standing for 5min, and removing supernatant with magnetic frame; after 3 washes with 80% ethanol, the cDNA on the beads was eluted with 5. mu.l Tris-HCl pH 7.5 at 5mM and transferred to a new 0.2ml PCR tube;
(24) adding a random 10N joint into a PCR tube, heating for 2min at 75 ℃ in a metal bath to break a secondary structure;
(25) preparing a cDNA double-chain synthesis system, adding the cDNA double-chain synthesis system into a PCR tube, and incubating at room temperature overnight;
cDNA two-Strand Synthesis System:
Figure BDA0003130518650000062
(26) add 5. mu.l MyONE beads purified cDNA and elute the cDNA with 25. mu.l Tris-HCl pH 7.5, 5 mM;
(27) taking 2 mul cDNA as a template, and amplifying 40 circulating identification concentrations by qPCR;
qPCR reaction system:
Figure BDA0003130518650000063
(28) amplifying the library by PCR, wherein the cycle number is determined according to the Ct value amplified by qPCR;
and (3) PCR reaction system:
Figure BDA0003130518650000064
Figure BDA0003130518650000071
(29) purifying the PCR amplification product by using Ampure XP magnetic beads;
(30) after agarose electrophoresis, cutting a band of 350bp in 175-plus to recover a product;
(31) repeat the agarose electrophoresis to determine if the library contains linker contamination; repeat PCR amplification to agarose recovery steps are required if there is linker contamination.
2. Analysis of sequencing data
The meclid library was transcriptome sequenced using NovaSeq, read at2 × 150 bp. Trim Galore removes linkers and base sequences with masses below 20. The mecLIP library was used for alignment with reads greater than 20nt in length after removal of 7nt of the tag sequence and 10N of the random sequence.
According to the method of the eCLIP article, only read2 from the mecLIP library was aligned using HISAT2 (v2.1.0). The HISAT2 software parameters were adjusted to-no-softclip, -score-min-L, -0.6, -0.6 and-max-intronlen 20000 to allow more mismatches while reducing chimera misalignments. Reads aligned to unique positions in the genome were extracted and PCR repeats were filtered based on the 10N random sequence at the beginning of read2 and the alignment start and stop sites. And counting the 5' end position and mutation site information of the filtered reads.
3. Single base level m6Identification of A site
(1) Extraction of training samples
The samples required by training comprise positive example samples and negative example samples, wherein the positive example samples consist of two parts: CTKtools identified m6A site and MazF-identified m6And (3) A site.
Ctkools identified m for construction of the Positive case6The a site needs to satisfy the following two conditions: 1) m at merrip6A is in a modification interval; 2) counting m corresponding to MeRIP data6Modifying coverage of IP reads and Input reads in interval A, and reserving IP reads/Input reads>2m6And (3) A site. These m are then combined6The A sites were divided into three parts according to the enrichment degree of reads (IP reads/Input reads), and 4000 were extracted each. M identified by MazF6The cutting rate of the A site is required to be less than 0.7, the A site is divided into four groups according to the cutting rate from high to low, 2500 sites are randomly extracted from each group, and if the cutting rate of a certain group is less than 2500 sites, the rest sites are averagely extracted from other groups. To increase the number of training samples, we step in 1nt at m6Extracting sequences with the length of 11nt in an interval of 10nt upstream and downstream of the site A to finally obtain 242000 sequences containing m6A positive example of site a.
Negative samples were randomly drawn from the protein-encoding gene, and two conditions were satisfied: 1) m is not in the identified6The A site is within 10 nt; 2) m is6No A is contained in 10nt upstream and downstream of the A site.
5000 samples not included in the training set were extracted as the test set according to the choice paradigm method.
(2) Sample feature extraction and normalization
Counting the numbers of reads truncation and mutation of the positive and negative samples in the mecLIP-gelfree data, and keeping the samples if the reads truncation or mutation on the sample sequence is more than 11. In order to reduce the influence of the expression difference of the genes of the sample on the data characteristics, the truncation and mutation of reads on each base of the sequence need to be standardized according to the formula 3.1.
Figure BDA0003130518650000081
(3) Training of BRNN
To avoid the problem of gradient vanishing during training, the BRNN was constructed using gated-cyclic units. The input data to the neural network is a two-dimensional vector of length 11 containing truncations and mutations at each base. Network structures with 2, 4 and 8 hidden layers and 16, 32 and 64 neuron combinations were used for test screening of the best model. To avoid gradient explosion, the gradient clipping value is set to 1. As a two-classifier, we input only the feature variables output by the neuron of the last hidden layer of the sequence into the fully-connected layer. The method outputs two variables, and obtains a classification result after the transformation by using a Softmax activation function. In addition, because the training sample data is less, the transformed feature variables output by the neuron of the last hidden layer are tested to be classified by using the SVM, so that a more stable classification result is obtained.
(4) m in mecLIP6Identification and incorporation of A modification intervals
Extracting sequences of 10nt upstream and downstream of all A points of mecLIP data, inputting the sequences into an optimal model found through training to predict and distinguish m6A sequence and non-m6And (B) sequence A.
Calculate m6The reads coverage on the left and right boundaries of the A sequence is combined into a cluster only if the ratio of the reads number of the left and right boundaries of two adjacent or overlapping sequences with the interval less than 10nt is more than 0.9.
(5)m6Identification of A site
At m6Identification of significant m using scanning statistics on A sequence clusters6And (3) A site.
First assume at one m6Cluster A [ alpha, beta ]]Truncation of the upper reads follows Poisson distribution and the mean value of the unit interval is
Figure BDA0003130518650000082
In any interval [ x, x + w]The probability of occurrence of truncation above can be obtained from equation 3.2:
Figure BDA0003130518650000083
where Yx (w) represents the number of truncations over a length w interval. The number of truncations over each subinterval is independently distributed, then the scan statistic is defined as:
S(w)=max(Yx(w);α≤x≤β-w) (3.3)
the probability of more than k truncations occurring in a subinterval of length w is:
Figure BDA0003130518650000091
wherein L is m6Length of cluster a. Then the probability that the event is less than k in one subinterval is:
Figure BDA0003130518650000092
the probability in equation 3.5 can be calculated from Naus derived approximate equation 3.6:
Figure BDA0003130518650000093
wherein the content of the first and second substances,
Figure BDA0003130518650000094
and
Figure BDA0003130518650000095
the following two formulas can be used for calculation:
Figure BDA0003130518650000096
Figure BDA0003130518650000097
Figure BDA0003130518650000098
in equation 3.6
Figure BDA0003130518650000099
Figure BDA00031305186500000910
Figure BDA00031305186500000911
Figure BDA00031305186500000912
Figure BDA00031305186500000913
In the above-mentioned formula,
Figure BDA00031305186500000914
when k is<At the time of 0, the number of the first,
Figure BDA00031305186500000915
the method specifically comprises the following steps:
1.m6identification of A sequence
m6The mutation and truncation caused by A are mainly within 5nt above and below the A, so that the A is selected to be 11nt long and must contain a known m6The sequence of A was used to construct the positive examples. Negative examples sequences without any A were selected to prevent the inclusion of unidentified m in the sequences6A to generate interference. The positive and negative examples are normalized and filtered before being input into the BRNN, requiring at least more than 11 truncations or mutations in these sequences.
The truncation and mutation at each base of the training sample are then input into the BRNN in turn (m can be extracted simultaneously)6Base information upstream and downstream of a) was performed. Although there must be a known m in the 11nt interval of the positive example6A site, but m for the other 10 bases6Possibility of A (m not identified by previous methods)6A site). Such implicit states of positive examples may make it difficult to train a BRNN to depend on neuronsOutput of (4) to perform m at a single base level6Purpose of A site identification. Therefore, we only use BRNN as a two-classifier and classify the input sequence by using the output value of the last neural network (judge that this interval contains m)6Whether a is absent). Because the training sample data set is small, in order to improve the classification precision, the BRNN is extracted and converted into m6The A characteristics are input into SVM for classification, and the effect of the BRNN-SVM combined classifier is compared with the result of direct classification by BRNN (fig. 2A).
As can be seen from fig. 2B, the number of neurons and the number of hidden layers in the BRNN model are too large, which in turn reduces the classification capability of the BRNN model. In a network architecture with 2 or 4 hidden layers, the number of neurons does not have much impact on the classification capability of the model. This may be because m6The input features at the a-site contain a small amount of information, resulting in a certain number of neurons being required to fit the features. In the BRNN model trained with 4 hidden layers and 32 neurons, the test sample classification accuracy was highest (up to 85.6%). For lifting the model pair m6The classification capability of the A sequence, namely the classification characteristics extracted by BRNN, is input into an SVM classifier for training, but the result shows that the classification effect is not improved (fig. 2B). When comparing the performance of classifiers using different linear and gaussian radial kernel parameters, it was found that the classification capability of SVMs was better only when outliers were less considered. This means that even m6The characteristics of the A site are transformed by BRNN, but the characteristics of partial positive and negative samples are too similar, so that some support vectors are difficult to find to distinguish the sites.
Although the A-containing sequence of 11nt in length on the genome was predicted using the trained optimal BRNN model, m was identified6The A sequence has a certain error rate and is not identified to the single base level, but the error rate of the A sequence and the m sequence can be reduced by the step6Mutation of A sequence and truncation of non-m of similar characteristics6And the interference of the sequence A improves the accuracy of subsequent statistical test.
Considering m6The coverage of the A sequence being with its methyl groupThe level of methylation is positively correlated (characteristic of co-immunoprecipitation experiments), if an adjacent m is identified6The a sequence coverage is similar, then the characteristics of the mutation and truncation should be similar. Therefore, m with a plurality of predicted spacing distances smaller than 10nt and small difference of reads coverage of edges is obtained6The A sequences are combined into clusters. 44838 and 80233 m were identified in two replicates of mecLIP-gelfree, respectively6And (4) clustering A. The A site on these clusters and the number of 1nt truncation and mutations downstream were then tested for significance using scanning statistics to determine m at the single base level6A sites (identified as m with q values less than 0.05)6A site).
13890 and 35777 m were identified in the library data prepared by MaximaH and Superscript III, respectively6A site, about 30% -45% of m6Cluster A identifies m6And (3) A site. In view of this, it is statistically verified whether m is present in a cluster6A sites are divided into two types, and coverage of reads on clusters and m are analyzed6A relationship of detection rate. FIG. 3A shows that m is not detected6Clusters of A sites have overall low reads coverage and m6Significant differences in reads coverage of clusters detected at the A site (p)<2.2e-16Willloxon randsed sum test). Although m at the single base level can be identified using scanning statistics6A site, but this also results in m being only relatively high in methylation6The A site can be detected. The available data of the mecLIP-gelfree library prepared from Superscript III is more, so that it is extracted in multiple portions according to different proportions to detect m6Saturation analysis was performed at site a. As can be seen from FIG. 3B, M is at times when there are only 4-8M reads6The number of A site detected slowly increased with increasing data; after the data exceeds 8M, M6The detected number of A sites is rapidly increased; by the time the data exceeded 30M, the growth slowed, indicating that the detected M6A site approached saturation.
2. M identified using a model6Analysis of A site
(1) Analysis of m6A site in mecLIP-gelfree data
M identified herein by the model6Site A between samplesThe reproducibility of (A) and the distribution on the transcriptome were compared.
As can be seen from FIG. 4A, m is identified from the model6The A site is mainly concentrated on 3' UTR and accords with the arabidopsis m reported by the current literature6And (4) A distribution characteristic. Counting m of different gene characteristic regions6Site A shows approximately 79% m for both libraries6The A site is located in the 3 'UTR, and secondly over 18% of the CDS, while the 5' UTR is only around 2.5% (FIG. 4B). M identified in the two mecLIP-gelfree data69222A sites are common, accounting for m identified in MaximaH-generated mecLIP library data6More than 66% of the A sites gave better reproducibility of the results (FIG. 4C). A reaction of these m6A site and known m6The overlap of the A modification intervals is found that more than 75% of the A modification intervals are located in m6A within the modification interval (FIG. 4D).

Claims (6)

1. RNA m based on BRNN model and statistical test6A method for identifying a modification site, comprising: comprises the following steps of (a) carrying out,
(1) selecting samples, and dividing the samples into a training set, a verification set and a test set;
(2) constructing a bidirectional recurrent neural network for use with and without m6Classifying the sequence of A;
(3) training the constructed bidirectional cyclic neural network by using the tips and the mutation characteristics of the training set and the verification set sequences, and optimizing parameters of a neural network model;
(4) testing the optimized bidirectional circulation network by using the base sequence samples in the test set, and counting the recognition result;
(5) will be identified as protection m6The base sequences of A are merged according to the difference of the coverage of edge reads;
(6) performing statistical test on the truncation number of A and 1nt downstream thereof to obtain m6And (C) identifying the A single site.
2. The BRNN model and statistical test-based RNA m according to claim 16Identification of A modification sitesThe method is characterized in that: in the step (1), m in RIP-seq is extracted6M identified by CTKtools within A modification interval6Sequence truncation and mutation information around the a site was taken as a positive case and sequences without a were randomly selected as a negative case.
3. The BRNN model and statistical test-based RNA m according to claim 16A method for identifying a modification site, comprising: negative samples were randomly drawn from the protein-encoding gene and met the following requirements: 1) m is not in the identified6The A site is within 10 nt; 2) m is6No A is contained in 10nt upstream and downstream of the A site.
4. The BRNN model and statistical test-based RNA m according to claim 16A method for identifying a modification site, comprising: ctkools identified m for construction of the Positive case6The a site needs to meet the following requirements: 1) m at merrip6A is in a modification interval; 2) counting m corresponding to MeRIP data6The coverage of IP reads and Input reads in the A modification interval is maintained, and m of which the IP reads/Input reads is more than or equal to 2 is reserved6And (3) A site.
5. The BRNN model and statistical test-based RNA m according to claim 16A method for identifying a modification site, comprising: and extracting and standardizing sample characteristics, counting the numbers of reads truncations and mutations of positive and negative sample in the mecLIP-gelfree data, and keeping if the number of reads truncations or mutations in the sample sequence is more than 11.
6. The BRNN model and statistical test-based RNA m according to claim 16A method for identifying a modification site, comprising: in step (5), m is calculated6And merging the reads coverage on the left and right boundaries of the A sequence.
CN202110704166.1A 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method Active CN113450875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110704166.1A CN113450875B (en) 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110704166.1A CN113450875B (en) 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method

Publications (2)

Publication Number Publication Date
CN113450875A true CN113450875A (en) 2021-09-28
CN113450875B CN113450875B (en) 2024-01-02

Family

ID=77812505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110704166.1A Active CN113450875B (en) 2021-06-24 2021-06-24 BRNN model and statistical test-based RNA m6A modification site identification method

Country Status (1)

Country Link
CN (1) CN113450875B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060003325A1 (en) * 2002-07-30 2006-01-05 Tum Gene, Inc. Method of detecting base mutation
CN107385078A (en) * 2017-08-30 2017-11-24 浙江大学 Label of pig fat deposition description correlation PNPLA2 mRNA m6A methylate the authentication method and application of function of unit point
CN107419026A (en) * 2017-08-30 2017-12-01 浙江大学 Label of pig fat deposition description correlation UCP2 mRNA m6A methylate the authentication method and application of function of unit point
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060003325A1 (en) * 2002-07-30 2006-01-05 Tum Gene, Inc. Method of detecting base mutation
CN107385078A (en) * 2017-08-30 2017-11-24 浙江大学 Label of pig fat deposition description correlation PNPLA2 mRNA m6A methylate the authentication method and application of function of unit point
CN107419026A (en) * 2017-08-30 2017-12-01 浙江大学 Label of pig fat deposition description correlation UCP2 mRNA m6A methylate the authentication method and application of function of unit point
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Also Published As

Publication number Publication date
CN113450875B (en) 2024-01-02

Similar Documents

Publication Publication Date Title
US11286524B2 (en) Multi-position double-tag connector set for detecting gene mutation and preparation method therefor and application thereof
CN102682224B (en) Method and device for detecting copy number variations
CN111755072B (en) Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN105861710A (en) Sequencing joint and preparation method and application thereof in ultra-low frequency mutation detection
CN101278058A (en) Improved strategies for sequencing complex genomes using high throughput sequencing technologies
WO2013176958A1 (en) Methods and compositions for analyzing nucleic acid
CN115198023B (en) Hainan cattle liquid-phase breeding chip and application thereof
CN104480214A (en) Process for sequencing hydroxymethylation and methylation long-sequence tags
CN114196761A (en) Method for manufacturing liquid chip for selecting reward of parent strain pig feed
CN106520958B (en) Method for developing microsatellite marker locus and method for detecting length of microsatellite marker in microsatellite marker locus
CN113337590A (en) Second-generation sequencing method and library construction method
CN113450875A (en) Identification method of RNA m6A modification site based on BRNN model and statistical test
CN115843318B (en) Plant species identification method based on whole genome analysis and genome editing and application
CN113564266B (en) SNP typing genetic marker combination, detection kit and application
US20230340609A1 (en) Cancer detection, monitoring, and reporting from sequencing cell-free dna
CN113699253A (en) Laoshan milk goat low-density liquid-phase SNP chip and application thereof
JP2008161056A (en) Dna sequence analyzer and method and program for analyzing dna sequence
CN108949945B (en) Sequencing library with single base resolution for detecting DNA methylation and single nucleotide variation and application
CN106520955B (en) Development method of rice microsatellite marker locus and length detection method of microsatellite marker in microsatellite marker locus
CN106636362B (en) Soybean microsatellite marker locus development method and microsatellite marker length detection method in microsatellite marker locus
CN109280699A (en) A kind of new varieties identification method based on ddRAD method
CN106520961B (en) Corn microsatellite marker locus development method and length detection method of microsatellite markers in microsatellite marker locus
CN112941151B (en) Method for constructing EST-SSR (expressed sequence tag-simple sequence repeat) marker by using RNA-seq technology and application
CN111944806A (en) Molecular tag group for high-throughput sequencing pollution detection and application thereof
CN106755314B (en) Development method of wheat microsatellite marker locus and length detection method of microsatellite marker in microsatellite marker locus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210043 Jiangsu Nanjing Qixia District Bagua Zhou street Jiangsu Qixia modern agriculture industrial park Nanjing Agricultural University modern horticulture industry science and Technology Innovation Center

Applicant after: NANJING AGRICULTURAL University

Address before: Weigang Xuanwu District of Nanjing Jiangsu province 210095 No. 1

Applicant before: NANJING AGRICULTURAL University

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant