CN101886114A

CN101886114A - Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)

Info

Publication number: CN101886114A
Application number: CN2009100512390A
Authority: CN
Inventors: 曾华宗; 周杰
Original assignee: SHANGHAI CLUSTER BIOTECH CO Ltd
Current assignee: SHANGHAI CLUSTER BIOTECH CO Ltd
Priority date: 2009-05-14
Filing date: 2009-05-14
Publication date: 2010-11-17

Abstract

The invention relates to a method for rapidly and accurately analyzing high-throughput sequencing results based on a new RMI (Read Mass Index). The method is simultaneously suitable for analyzing data generated by using a transcriptome high-throughput sequencing (RNA-seq) method and a chromosome immune coprecipitation high-throughput sequencing (ChIP-seq) method.

Description

Based on RMI exponential high-flux sequence data analysing method

Technical field

The invention belongs to biological technical field, relate to high throughput sequencing technologies data analysis of new generation.

Background technology

The present invention is a kind of high throughput sequencing technologies that is applicable to--Solexa[Bennett, S., SolexaLtd.Pharmacogenomics, 2004.5 (4): new analytical procedure p.433-8].Present method is applicable to simultaneously transcribes group order-checking--RNA-seq[Wold, B.and R.M.Myers, Sequencecensus methods for functional genomics.Nat Methods,: p.19-21]) and co-immunoprecipitation order-checking--ChIP-seq[Johnson 2008.5 (1), D.S., et a1., Genome-wide mapping of in vivo protein-DNA interactions.Science, 2007.316 (5830): p.1497-502] two kinds of data that experimental technique produces.

High throughput sequencing technologies is the important breakthrough of biological technical field over the past two years, and the sequencing technologies of a new generation has improved hundreds of times with traditional Sanger order-checking efficient, and price also descends greatly simultaneously.The appearance of high throughput sequencing technologies makes many extremely promising biological medicines use becomes possibility: 1, and the cancer genome.2, personalized medicine and diagnosis.3, the drug targets screening.Can high throughput sequencing technologies the making progress of these fields, and its key is the innovation of analytical procedure and software.This patent has proposed a kind of new analytical procedure, can be widely used in the data analysis of high throughput sequencing technologies.

So far, the high-flux sequence platform of extensively quoting mainly contains three kinds: 1, and the Genome Analyzer of Illumina company (the being called Solexa again) platform that checks order.2, the SoLiD order-checking platform of ABI company.3,454 order-checking platforms of Roche company.Three kinds of order-checking platforms also respectively have characteristics based on different sequencing technologies respectively, but present widespread use the most is the Solexa platform.

The Solexa platform is fixed in solid surface with DNA to be amplified and order-checking, uses BridgePCR amplification amplification of DNA fragments, and uses reverse dye terminator technology to check order.About 8000 dollars of cost of Solexa platform operation can produce～sequence data of about 40,000,000 35-70bp.Solexa platform cost is far below 454 platforms (in every bp cost), and do not have the problem of the existing G/C deviation of SoLiD technology, therefore extensively quoted in the biological study field.

The Solexa technology mainly contains two portions application at present: 1, and RNA-Seq promptly transcribes the group order-checking.After mRNA reverse transcription in the cell or tissue is eDNA, increases and import the order-checking of Solexa platform, can obtain the expression amount of mRNA after the result who obtains analyzes.The RNA-seq technology is owing to have accurate quantification and highly sensitive characteristics, and being considered to will very fast replacement Microarray technology.2, ChIP-Seq, i.e. co-immunoprecipitation sequencing technologies.This technology can be located transcription factor (transcription factor) and is widely used in biomedical research with the binding site (binding site) of DNA.

Data analysis software at the Solexa technology platform has following a few class at present: 1, sequence contraposition software, the reads of Solexa order-checking is located [Langmead fast on genome, B., et al., Ultrafast and memory-efficient alignment of short DNA sequencesto the human genome.Genome Biol, 2009.10 (3): p.R25].2, the RNA-seq analysis software is determined each expression of gene amount [Mortazavi according to the data of RNA-seq, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq.Nat Methods, 2008.5 (7): p.621-8].3, the ChIP-seq analysis software, the result of ChIP-seq is resolved to transcription factor binding site point (transcription factor binding site) [Rozowsky, J., et al., PeakSeq enables systematic scoring of ChIP-seqexperiments relative to controls.Nat Biotechnol, 2009.27 (1): p.66-75].This patent is devoted to back two classes application, and has proposed brand-new analysis thinking to improve the quality of analytical results.

Summary of the invention

The present invention is based on present Solexa sequencing technologies, found a kind of new analytical procedure that can define differential expression and transcription factor binding site point, significantly improved analysis precision with respect to other analytical procedures.The step of present method is as follows:

(1) obtains the Solexa sequencing sequence, use ELAND software to carry out contraposition (Alignment) to the reference gene group all sequences.Sequence (sequence too low as sequencing quality) that can't contraposition abandons.The highest contraposition result the highest or arranged side by side for the sequence that multiple contraposition is arranged (multiple hits) retention score.

(2) the bit sequence file that obtains is changed into RMI (Read Mass Index) Score.The method of calculation of RMI are as follows:

RMI＝(Read?Coverage/Mappability)*Adjustment

The number of times that checked order for this site of Read Coverage wherein, the Read Coverage that we can utilize the contraposition file directly to calculate to be accurate to every bp.

Mappability represents the theoretical value that this section is covered by stochastic sequence under null hypothesis.This theoretical value and distribution thereof are depended on reference to genome, can't calculate with theoretical formula, but we can utilize and carry out computer Simulation with reference to genome (Reference Genome) and draw.Its Calculation Method is: will be split as the 35bp segment of (perhaps 70bp depends on the length of Solexa order-checking) with reference to genome, be step-length with 1bp, and the section that each is possible in theory all takes out, then with all segments all to the contraposition of protogene group.The contraposition result who so obtains is the theoretical distribution of Mappability.Obviously, the Mappbility of the tumor-necrosis factor glycoproteins in the genome will be than unique sequence Mappability height, and this also is that we will carry out gauged reason to Mappability when calculating RMI.

Adjustment is the correction at this time order-checking.Deposit just sequencing error in the process of Solexa order-checking, therefore be not all sequence can perfect contraposition (perfect match) to the reference genome.Have some sequences that the error (1bp mismatch) of 1bp will be arranged, other has some sequences that the error (sequence more than the 2bp error will not considered) of 2bp is arranged.Present method has certain point penalty to the sequence that mismatch is arranged, and thinks that promptly the confidence level of these sequences is lower than the sequence of perfect contraposition (perfect match).Through overtesting, present method is made as 50% with the sequence confidence level of 1bp mismatch, and the sequence confidence level of 2bp mismatch is made as 25%.

(3) through after the above step, we have obtained the RMIindex in the full genome range.Following step is to calculate the theoretical distribution of RMI.Next will be divided into two kinds of situation discussion: A, RNA-seq analyzes.B, Chip-seq analyzes.

(A) RNA-seq analyzes.RNA-seq analyzes comparatively simple.In general, our experimental design is two samples of contrast, and perhaps a series of seasonal effect in time series samples compare mutually.We have obtained actual RMI by step (2) and have distributed, and now calculative is the theoretical distribution of RMI.In given section, this distribution will be a binominal distribution:

fx (x) = P [X = k] = (\begin{matrix} n \\ k \end{matrix}) p^{k} {(1 - p)}^{n - k} fork = 0,1,2, . . ., n

Under theoretical situation, each expression of gene amount known (being assumed to E), during this distributes n for the sum (Total number of reads) that is listed as of order-checking, p is the theoretical value of the sequence number of this section: (Mappability/ ∑ Mappability) * n*E.Yet gene has different separately expression amounts, can't learn before the experiment.Therefore we choose a sample in the experiment, calculate E and corresponding p, define it and are standard value.The numerical value that other sample calculation goes out then carries out statistical test with this standard value.If significant difference occurs, then show to have significant difference.In order to carry out above statistical test, we need know the standard deviation of this distribution.We know that the standard deviation of binominal distribution is p (1-p).Last parameter k is actual herein sequence number (number of observed reads) in the formula.For simplifying the calculating of back, when sequence number is big (more than or equal to 30), binominal distribution can be simulated with normal distribution:

f (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}

The mean value of this distribution is identical with previous distribution with standard deviation, μ=(Mappability/ ∑ Mappability) * n*E, σ ^2=μ * (1-μ).

As previously mentioned, obtain after the theoretical distribution of RMI, suppose that we will carry out the differential gene expression analysis to the RNA-seq result of two samples, then we to set one of them sample be standard, variance known (theoretical value) so carry out standard z check with another sample, obtains p-value.If a plurality of biological repeat (biological replicates) are arranged, then can carry out the t check.

(4) be control false positive results (false positive results), when carrying out the genome range experiment, multiple correction is essential.Here we will carry out the correction of Bonferroni multiple check to the p-value that step (3) obtain.Suppose us with step-length x, interval y carries out the t check to whole genome, supposes that genome length is L.Then our check sum that carries out is: N=(L-y)/x.According to this amount of testing, we think that having only p-value is remarkable result less than the side as a result of 0.05/N.

(B) ChIP-seq analyzes.Analysis principle and the RNA-seq of ChIP-seq are basic identical.Unique different step is: the ChIP-seq experiment compares with Input sample and ChIP-ed sample.Therefore we are standard with the Input sample, and setting its expression amount at certain section is E, and the parameter in the corresponding binominal distribution is P, and its standard deviation is P* (1-P).Thereafter step is identical with RNA-seq.

The feature of present method:

Present method has designed a kind of new analytical procedure based on the RMI index, is applicable to the analysis of RNA-seq and ChIP-seq two big class high-flux sequence experiments.This method has more elasticity than existing additive method, and higher accuracy is arranged.Be embodied in: 1) analyzing RNA-seq as a result the time, is not that gene is a unit, but is unit with the section.The size of section can define arbitrarily, so present method can detect differential expression and variable shearing simultaneously, and this is the remarkable difference of present method and additive method.2) when analyzing the ChIP-seq sequence, as long as do a little optimization, (such as, it can be the minimized interval of p-value that binding site is defined as), then present method will provide the peak value that is accurate to base, will be more accurate than additive method.This is the difference place of RMI method than other ChIP-seq analytical procedures.

More than be the description of this invention and non-limiting, based on other embodiment of inventive concept, all among protection scope of the present invention.

Claims

1. based on RMI exponential high-flux sequence analyzing novel methods, be characterized in the high-flux sequence result being carried out rapid and accurate analysis based on a new index (RMI:Read Mass Index).The method is characterized in that and have the following steps:

Step 1: obtain transcript high-flux sequence (RNA-seq) or karyomit(e) co-immunoprecipitation high-flux sequence (ChIP-seq) data.

Step 2: according to high-flux sequence information, and proofread and correct based on species gene group sequence information, the experience of estimation RMI distributes.

Step 3: utilize the RMI experience to distribute and sequencing data, identify that differential expression section or difference are in conjunction with the peak.

2.RMI method of calculation be:

RMI＝(Read?Coverage/Mappability)*Adjustment

The number of times that checked order for this site of Read Coverage wherein

Mappability represents the theoretical value that this section is covered by stochastic sequence under null hypothesis.The method of calculation of RMI are as follows:

RMI＝(Read?Coverage/Mappability)*Adjustment

Mappability represents the theoretical value that this section is covered by stochastic sequence under null hypothesis.Calculation Method is the segment that the reference genome is split as n base (the n value reads the length of sequence for different order-checking instruments), is step-length with 1 base, and the section that each is possible in theory all takes out, then with all segments all to the contraposition of protogene group.The contraposition result who so obtains is the theoretical distribution of Mappability.

Adjustment is the correction at this time order-checking.Promptly, think that promptly the confidence level of these sequences is lower than the sequence of perfect contraposition to there being unmatched sequence that certain point penalty is arranged.Through overtesting, present method is made as 50%, 2 unmatched sequence adjustment value of base with 1 unmatched sequence adjustment value of base and is made as 25%.

3. at high-flux sequence instrument Solexa, 454 (GS-FLX) all can use the RMI index to carry out data analysis among SOLiD and the Polonator.