CN101886114A - Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index) - Google Patents

Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index) Download PDF

Info

Publication number
CN101886114A
CN101886114A CN2009100512390A CN200910051239A CN101886114A CN 101886114 A CN101886114 A CN 101886114A CN 2009100512390 A CN2009100512390 A CN 2009100512390A CN 200910051239 A CN200910051239 A CN 200910051239A CN 101886114 A CN101886114 A CN 101886114A
Authority
CN
China
Prior art keywords
rmi
sequence
mappability
read
seq
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009100512390A
Other languages
Chinese (zh)
Inventor
曾华宗
周杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI CLUSTER BIOTECH CO Ltd
Original Assignee
SHANGHAI CLUSTER BIOTECH CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI CLUSTER BIOTECH CO Ltd filed Critical SHANGHAI CLUSTER BIOTECH CO Ltd
Priority to CN2009100512390A priority Critical patent/CN101886114A/en
Publication of CN101886114A publication Critical patent/CN101886114A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a method for rapidly and accurately analyzing high-throughput sequencing results based on a new RMI (Read Mass Index). The method is simultaneously suitable for analyzing data generated by using a transcriptome high-throughput sequencing (RNA-seq) method and a chromosome immune coprecipitation high-throughput sequencing (ChIP-seq) method.

Description

Based on RMI exponential high-flux sequence data analysing method
Technical field
The invention belongs to biological technical field, relate to high throughput sequencing technologies data analysis of new generation.
Background technology
The present invention is a kind of high throughput sequencing technologies that is applicable to--Solexa[Bennett, S., SolexaLtd.Pharmacogenomics, 2004.5 (4): new analytical procedure p.433-8].Present method is applicable to simultaneously transcribes group order-checking--RNA-seq[Wold, B.and R.M.Myers, Sequencecensus methods for functional genomics.Nat Methods,: p.19-21]) and co-immunoprecipitation order-checking--ChIP-seq[Johnson 2008.5 (1), D.S., et a1., Genome-wide mapping of in vivo protein-DNA interactions.Science, 2007.316 (5830): p.1497-502] two kinds of data that experimental technique produces.
High throughput sequencing technologies is the important breakthrough of biological technical field over the past two years, and the sequencing technologies of a new generation has improved hundreds of times with traditional Sanger order-checking efficient, and price also descends greatly simultaneously.The appearance of high throughput sequencing technologies makes many extremely promising biological medicines use becomes possibility: 1, and the cancer genome.2, personalized medicine and diagnosis.3, the drug targets screening.Can high throughput sequencing technologies the making progress of these fields, and its key is the innovation of analytical procedure and software.This patent has proposed a kind of new analytical procedure, can be widely used in the data analysis of high throughput sequencing technologies.
So far, the high-flux sequence platform of extensively quoting mainly contains three kinds: 1, and the Genome Analyzer of Illumina company (the being called Solexa again) platform that checks order.2, the SoLiD order-checking platform of ABI company.3,454 order-checking platforms of Roche company.Three kinds of order-checking platforms also respectively have characteristics based on different sequencing technologies respectively, but present widespread use the most is the Solexa platform.
The Solexa platform is fixed in solid surface with DNA to be amplified and order-checking, uses BridgePCR amplification amplification of DNA fragments, and uses reverse dye terminator technology to check order.About 8000 dollars of cost of Solexa platform operation can produce~sequence data of about 40,000,000 35-70bp.Solexa platform cost is far below 454 platforms (in every bp cost), and do not have the problem of the existing G/C deviation of SoLiD technology, therefore extensively quoted in the biological study field.
The Solexa technology mainly contains two portions application at present: 1, and RNA-Seq promptly transcribes the group order-checking.After mRNA reverse transcription in the cell or tissue is eDNA, increases and import the order-checking of Solexa platform, can obtain the expression amount of mRNA after the result who obtains analyzes.The RNA-seq technology is owing to have accurate quantification and highly sensitive characteristics, and being considered to will very fast replacement Microarray technology.2, ChIP-Seq, i.e. co-immunoprecipitation sequencing technologies.This technology can be located transcription factor (transcription factor) and is widely used in biomedical research with the binding site (binding site) of DNA.
Data analysis software at the Solexa technology platform has following a few class at present: 1, sequence contraposition software, the reads of Solexa order-checking is located [Langmead fast on genome, B., et al., Ultrafast and memory-efficient alignment of short DNA sequencesto the human genome.Genome Biol, 2009.10 (3): p.R25].2, the RNA-seq analysis software is determined each expression of gene amount [Mortazavi according to the data of RNA-seq, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq.Nat Methods, 2008.5 (7): p.621-8].3, the ChIP-seq analysis software, the result of ChIP-seq is resolved to transcription factor binding site point (transcription factor binding site) [Rozowsky, J., et al., PeakSeq enables systematic scoring of ChIP-seqexperiments relative to controls.Nat Biotechnol, 2009.27 (1): p.66-75].This patent is devoted to back two classes application, and has proposed brand-new analysis thinking to improve the quality of analytical results.
Summary of the invention
The present invention is based on present Solexa sequencing technologies, found a kind of new analytical procedure that can define differential expression and transcription factor binding site point, significantly improved analysis precision with respect to other analytical procedures.The step of present method is as follows:
(1) obtains the Solexa sequencing sequence, use ELAND software to carry out contraposition (Alignment) to the reference gene group all sequences.Sequence (sequence too low as sequencing quality) that can't contraposition abandons.The highest contraposition result the highest or arranged side by side for the sequence that multiple contraposition is arranged (multiple hits) retention score.
(2) the bit sequence file that obtains is changed into RMI (Read Mass Index) Score.The method of calculation of RMI are as follows:
RMI=(Read?Coverage/Mappability)*Adjustment
The number of times that checked order for this site of Read Coverage wherein, the Read Coverage that we can utilize the contraposition file directly to calculate to be accurate to every bp.
Mappability represents the theoretical value that this section is covered by stochastic sequence under null hypothesis.This theoretical value and distribution thereof are depended on reference to genome, can't calculate with theoretical formula, but we can utilize and carry out computer Simulation with reference to genome (Reference Genome) and draw.Its Calculation Method is: will be split as the 35bp segment of (perhaps 70bp depends on the length of Solexa order-checking) with reference to genome, be step-length with 1bp, and the section that each is possible in theory all takes out, then with all segments all to the contraposition of protogene group.The contraposition result who so obtains is the theoretical distribution of Mappability.Obviously, the Mappbility of the tumor-necrosis factor glycoproteins in the genome will be than unique sequence Mappability height, and this also is that we will carry out gauged reason to Mappability when calculating RMI.
Adjustment is the correction at this time order-checking.Deposit just sequencing error in the process of Solexa order-checking, therefore be not all sequence can perfect contraposition (perfect match) to the reference genome.Have some sequences that the error (1bp mismatch) of 1bp will be arranged, other has some sequences that the error (sequence more than the 2bp error will not considered) of 2bp is arranged.Present method has certain point penalty to the sequence that mismatch is arranged, and thinks that promptly the confidence level of these sequences is lower than the sequence of perfect contraposition (perfect match).Through overtesting, present method is made as 50% with the sequence confidence level of 1bp mismatch, and the sequence confidence level of 2bp mismatch is made as 25%.
(3) through after the above step, we have obtained the RMIindex in the full genome range.Following step is to calculate the theoretical distribution of RMI.Next will be divided into two kinds of situation discussion: A, RNA-seq analyzes.B, Chip-seq analyzes.
(A) RNA-seq analyzes.RNA-seq analyzes comparatively simple.In general, our experimental design is two samples of contrast, and perhaps a series of seasonal effect in time series samples compare mutually.We have obtained actual RMI by step (2) and have distributed, and now calculative is the theoretical distribution of RMI.In given section, this distribution will be a binominal distribution:
fx ( x ) = P [ X = k ] = n k p k ( 1 - p ) n - k fork = 0,1,2 , . . . , n
Under theoretical situation, each expression of gene amount known (being assumed to E), during this distributes n for the sum (Total number of reads) that is listed as of order-checking, p is the theoretical value of the sequence number of this section: (Mappability/ ∑ Mappability) * n*E.Yet gene has different separately expression amounts, can't learn before the experiment.Therefore we choose a sample in the experiment, calculate E and corresponding p, define it and are standard value.The numerical value that other sample calculation goes out then carries out statistical test with this standard value.If significant difference occurs, then show to have significant difference.In order to carry out above statistical test, we need know the standard deviation of this distribution.We know that the standard deviation of binominal distribution is p (1-p).Last parameter k is actual herein sequence number (number of observed reads) in the formula.For simplifying the calculating of back, when sequence number is big (more than or equal to 30), binominal distribution can be simulated with normal distribution:
f ( x ) = 1 σ 2 π e - ( x - μ ) 2 2 σ 2
The mean value of this distribution is identical with previous distribution with standard deviation, μ=(Mappability/ ∑ Mappability) * n*E, σ ^2=μ * (1-μ).
As previously mentioned, obtain after the theoretical distribution of RMI, suppose that we will carry out the differential gene expression analysis to the RNA-seq result of two samples, then we to set one of them sample be standard, variance known (theoretical value) so carry out standard z check with another sample, obtains p-value.If a plurality of biological repeat (biological replicates) are arranged, then can carry out the t check.
(4) be control false positive results (false positive results), when carrying out the genome range experiment, multiple correction is essential.Here we will carry out the correction of Bonferroni multiple check to the p-value that step (3) obtain.Suppose us with step-length x, interval y carries out the t check to whole genome, supposes that genome length is L.Then our check sum that carries out is: N=(L-y)/x.According to this amount of testing, we think that having only p-value is remarkable result less than the side as a result of 0.05/N.
(B) ChIP-seq analyzes.Analysis principle and the RNA-seq of ChIP-seq are basic identical.Unique different step is: the ChIP-seq experiment compares with Input sample and ChIP-ed sample.Therefore we are standard with the Input sample, and setting its expression amount at certain section is E, and the parameter in the corresponding binominal distribution is P, and its standard deviation is P* (1-P).Thereafter step is identical with RNA-seq.
The feature of present method:
Present method has designed a kind of new analytical procedure based on the RMI index, is applicable to the analysis of RNA-seq and ChIP-seq two big class high-flux sequence experiments.This method has more elasticity than existing additive method, and higher accuracy is arranged.Be embodied in: 1) analyzing RNA-seq as a result the time, is not that gene is a unit, but is unit with the section.The size of section can define arbitrarily, so present method can detect differential expression and variable shearing simultaneously, and this is the remarkable difference of present method and additive method.2) when analyzing the ChIP-seq sequence, as long as do a little optimization, (such as, it can be the minimized interval of p-value that binding site is defined as), then present method will provide the peak value that is accurate to base, will be more accurate than additive method.This is the difference place of RMI method than other ChIP-seq analytical procedures.
More than be the description of this invention and non-limiting, based on other embodiment of inventive concept, all among protection scope of the present invention.

Claims (3)

1. based on RMI exponential high-flux sequence analyzing novel methods, be characterized in the high-flux sequence result being carried out rapid and accurate analysis based on a new index (RMI:Read Mass Index).The method is characterized in that and have the following steps:
Step 1: obtain transcript high-flux sequence (RNA-seq) or karyomit(e) co-immunoprecipitation high-flux sequence (ChIP-seq) data.
Step 2: according to high-flux sequence information, and proofread and correct based on species gene group sequence information, the experience of estimation RMI distributes.
Step 3: utilize the RMI experience to distribute and sequencing data, identify that differential expression section or difference are in conjunction with the peak.
2.RMI method of calculation be:
RMI=(Read?Coverage/Mappability)*Adjustment
The number of times that checked order for this site of Read Coverage wherein
Mappability represents the theoretical value that this section is covered by stochastic sequence under null hypothesis.The method of calculation of RMI are as follows:
RMI=(Read?Coverage/Mappability)*Adjustment
The number of times that checked order for this site of Read Coverage wherein, the Read Coverage that we can utilize the contraposition file directly to calculate to be accurate to every bp.
Mappability represents the theoretical value that this section is covered by stochastic sequence under null hypothesis.Calculation Method is the segment that the reference genome is split as n base (the n value reads the length of sequence for different order-checking instruments), is step-length with 1 base, and the section that each is possible in theory all takes out, then with all segments all to the contraposition of protogene group.The contraposition result who so obtains is the theoretical distribution of Mappability.
Adjustment is the correction at this time order-checking.Promptly, think that promptly the confidence level of these sequences is lower than the sequence of perfect contraposition to there being unmatched sequence that certain point penalty is arranged.Through overtesting, present method is made as 50%, 2 unmatched sequence adjustment value of base with 1 unmatched sequence adjustment value of base and is made as 25%.
3. at high-flux sequence instrument Solexa, 454 (GS-FLX) all can use the RMI index to carry out data analysis among SOLiD and the Polonator.
CN2009100512390A 2009-05-14 2009-05-14 Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index) Pending CN101886114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100512390A CN101886114A (en) 2009-05-14 2009-05-14 Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100512390A CN101886114A (en) 2009-05-14 2009-05-14 Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)

Publications (1)

Publication Number Publication Date
CN101886114A true CN101886114A (en) 2010-11-17

Family

ID=43072205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100512390A Pending CN101886114A (en) 2009-05-14 2009-05-14 Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)

Country Status (1)

Country Link
CN (1) CN101886114A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102676657A (en) * 2012-04-18 2012-09-19 盛司潼 Sequencing image recognition system and sequencing image recognition method
CN103810404A (en) * 2014-01-13 2014-05-21 哈尔滨工程大学 High-flux DNA sequencing data matching reinforcement method based on Bayes technology
CN105574365A (en) * 2016-01-22 2016-05-11 北京圣谷同创科技发展有限公司 Statistics verification method for high-throughput sequencing mutation detection results
CN107832585A (en) * 2017-11-23 2018-03-23 南宁科城汇信息科技有限公司 A kind of RNAseq data analysing methods
CN108959851A (en) * 2018-06-12 2018-12-07 哈尔滨工程大学 A kind of Illumina high-flux sequence data error correction method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102676657A (en) * 2012-04-18 2012-09-19 盛司潼 Sequencing image recognition system and sequencing image recognition method
CN103810404A (en) * 2014-01-13 2014-05-21 哈尔滨工程大学 High-flux DNA sequencing data matching reinforcement method based on Bayes technology
CN105574365A (en) * 2016-01-22 2016-05-11 北京圣谷同创科技发展有限公司 Statistics verification method for high-throughput sequencing mutation detection results
CN105574365B (en) * 2016-01-22 2018-10-26 北京圣谷同创科技发展有限公司 The statistics verification method of high-flux sequence abrupt climatic change result
CN107832585A (en) * 2017-11-23 2018-03-23 南宁科城汇信息科技有限公司 A kind of RNAseq data analysing methods
CN108959851A (en) * 2018-06-12 2018-12-07 哈尔滨工程大学 A kind of Illumina high-flux sequence data error correction method
CN108959851B (en) * 2018-06-12 2022-03-18 哈尔滨工程大学 Illumina high-throughput sequencing data error correction method

Similar Documents

Publication Publication Date Title
Müller et al. Capturing the dynamics of genome replication on individual ultra-long nanopore sequence reads
Borisov et al. Quantitation of molecular pathway activation using RNA sequencing data
CN105821042B (en) The relevant miRNA of mesenchymal stem cells derived from human umbilical blood Genome stability and its application
CA2949622C (en) Methods for standardized sequencing of nucleic acids and uses thereof
Korpelainen et al. RNA-seq data analysis: a practical approach
Yu et al. Identification of an m6A-related lncRNA signature for predicting the prognosis in patients with kidney renal clear cell carcinoma
CN112382362B (en) Data analysis method and device for target drugs
Chen et al. A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data
CN105986008A (en) CNV detection method and CNV detection apparatus
CN104711250A (en) Building method of long fragment nucleic acid library
CN101886114A (en) Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)
Haase et al. Differential expression analysis of human endogenous retroviruses based on ENCODE RNA-seq data
CN103984879A (en) Method and system for measuring regional RPKM of to-be-measured genome
CN104232760A (en) Method and device for determining sample source of reading segments in mixed sequencing data
CN109943626A (en) A kind of PCR method detecting absolute telomere length
CN103177197A (en) Differential expression detecting and alternative splicing analyzing method based on high throughput sequencing
Jagoda et al. Regulatory dissection of the severe COVID-19 risk locus introgressed by Neanderthals
Spaethling et al. Single-cell transcriptomics for drug target discovery
Ye et al. Discovery of alternative polyadenylation dynamics from single cell types
CN115948521A (en) Method for detecting aneuploid missing chromosome information
Scott et al. Assessment of acute myeloid leukemia molecular measurable residual disease testing in an interlaboratory study
Rutledge et al. Gene expression analysis of primordial shoot explants collected from mature white spruce (Picea glauca) trees that differ in their responsiveness to somatic embryogenesis induction
Hernandez-Lopez et al. Lossy compression of quality scores in differential gene expression: A first assessment and impact analysis
Warren et al. Development of Gene Expression-Based Biomarkers on the nCounter® Platform for Immuno-Oncology Applications
Li et al. Rfoot‐seq: Transcriptomic RNase Footprinting for Mapping Stable RNA‐Protein Complexes and Rapid Ribosome Profiling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20101117