A kind of analyzing gene is expressed quantitative method
Technical field
the present invention relates to the nucleic acid sequencing technical field, particularly the method for information analysis after RNA-seq technical field and order-checking.
Background technology
at present, the quantitative gene expression research field mainly contains two kinds of technology: traditional chip technology and sequencing technologies.Wherein, the chip technology flux is high, automatization, and cost is low, but chip technology depends on known, and signal noise is high, poor repeatability, detection threshold is narrow; Sequencing technologies is divided into again SAGE, digital gene express spectra (Digital Gene Expression, DGE) and digital gene express spectra upgrade version RNA-Seq (Quantification) technology, wherein, the order-checking of SAGE technology is accurate, but complex operation, the order-checking cost is high.DGE based on s-generation high-flux sequence platform and RNA-Seq technology have overcome the shortcoming of chip technology and SAGE technology, and their flux are high, automatization, and the order-checking cost is low, and noise is little, does not rely on known, and detection threshold is wide.
but DGE is due to the limitation of experiment itself, caused this technology the gene that does not contain the CATG site can not be detected, and the DGE technology is very strong to the dependency of reference gene during quantitative gene expression in research, for the quantitative analyses of some non-model animalss, also have some limitations.
the s-generation high throughput sequencing technologies that the illumina of take order-checking platform is representative has not only been saved a large amount of man power and materials, but also has the many merits that sequencing throughput is high, accuracy is high and cost is low.This platform is widely used at present: genome sequencing, and the new species order-checking, the order-checking of target gene group, transcribe the fields such as group and epigenetic analysis.
along with the widespread use of s-generation high-throughput illumina order-checking platform, carrying out on a large scale of the order-checking of many species genes group and full genome research, reduce the order-checking cost, reduces the order-checking flow process, raises labour efficiency and become an important research direction of sequencing technologies.And the gene expression analysis based on illumina order-checking platform RNA-seq exists step many, cost is high, and operating process is loaded down with trivial details, is not suitable for the shortcomings such as Automation workstation.
Summary of the invention
one aspect of the present invention provides a kind of analyzing gene to express quantitative method, comprising:
(1) purified mRNA from total RNA, prepare fragmentation mRNA;
(2) described fragmentation mRNA reverse transcription is prepared to cDNA, will be prepared as flat end DNA after described cDNA purifying, the flat end DNA of purifying;
(3) end of described flat end DNA is added to " A " base, obtain the DNA that end adds " A " base;
(4) the DNA two ends that add endways " A " base add joint sequence, and the DNA that the purifying two ends add joint sequence carries out the PCR reaction, purifying PCR reaction product;
(5) to described PCR reaction product order-checking;
(6) the defective sequence of data filter order-checking obtained obtains clean sequence, and described clean sequence and reference sequences are compared, and the comparison result is analyzed.
in one embodiment of the invention, the amount of choosing of described total RNA is 0.1 μ g~2 μ g.
in one embodiment of the invention, the Oligo(dT that uses Invitrogen company to produce)
25
(production number 610.06) magnetic bead purified mRNA from total RNA.
in one embodiment of the invention, the described cDNA of Ampure XP magnetic bead (production number A63882) purifying that uses Beckman company to produce, DNA fragmentation, the PCR reaction product that two ends add joint.
in one embodiment of the invention, use the reagent I to prepare fragmentation mRNA, described reagent I contains: the 10-400mM soluble salt, and the 200mM-300mM buffering salt, pH 8.0-8.5, solvent is water; Preferably, in the reagent I, buffering salt is selected from Tris-HCl, phosphoric acid salt.Preferably, in the reagent I, soluble salt is selected from sodium-chlor, magnesium chloride.Preferably, mRNA and reagent I mixing temperature are 65 ℃~94 ℃.
in one embodiment of the invention, use the reagent II to carry out the end reparation to cDNA, obtain flat end DNA, described reagent II contains: 1.2uLT4 DNA polysaccharase (3U/ μ L), 1.2uLT4 polynueleotide kinase (10U/ μ L), 0.2ulKlenow DNA polysaccharase (5U/ μ L), 0.4uL 25mM dNTP; T4 polynueleotide kinase damping fluid contains 700 mM Tris-HCl, 100 mM magnesium chlorides, 50 mM DTT.
in one embodiment of the invention, use the reagent III to add " A " base to the end of described flat end DNA, described reagent contains: 100 mM-500mM soluble salt, 100 mM buffering salts, 10mM-50mM dithiothreitol (DTT), 5mM dATP, 0.2 μ L Klenow (3 '-5 ' exo) enzyme (5U/ μ L), pH7.6-7.9, solvent is water.Preferably, in the reagent III, buffering salt is selected from Tris-HCl, phosphoric acid salt.Preferably, in the reagent III, soluble salt is sodium-chlor.Preferably, sample and reagent III mixing temperature are 16 ℃-37 ℃.
in one embodiment of the invention, the DNA two ends of using the reagent IV to add endways " A " base add joint sequence, described reagent IV contains:, 100 mM buffer salt solutions, 10mM~50mM dithiothreitol (DTT), 5~10mM ATP, 1.2 μ L T4 DNA ligase enzymes, the pH value is 7.6~7.9, and solvent is water.Preferably, buffer salt solution is Tris-HCl, phosphate buffer soln.
in one embodiment of the invention, also comprise step before to described PCR product order-checking: adopt Agilent Bioanalyzer 2100 and Q-PCR to detect DNA concentration and DNA fragmentation size.
in one embodiment of the invention, high throughput sequencing technologies is used in described order-checking.Preferably, use illumina solexa sequencing technologies.
in one embodiment of the invention, described defective sequence comprises: sequencing quality surpasses 50% sequence of whole piece sequence base number lower than the base number of predetermined threshold, and in sequence, the uncertain base number of sequencing result surpasses 10% sequence of whole piece sequence base number; The exogenous array of introducing except the sample joint sequence.
in one embodiment of the invention, SOAPaligner/soap2 is used in described comparison.
in one embodiment of the invention, described the comparison result is analyzed and comprised: the quality evaluation of high-flux sequence, the genetic expression quantitative statistics, differential gene expression screening, experimental repeatability is analyzed, and the differential gene expression pattern clustering is analyzed, Gene Ontology(GO) function significance enrichment analysis, the enrichment of path (Pathway) significance is analyzed, the protein-interacting network analysis.
be applied to quantitative gene expression research overcome the DGE technology to the CATG site and with reference to the gene complete dependency the very strong shortcoming shortcomings such as also to have overcome the chip technology detection threshold narrow, and noise pollution is large simultaneously.Thereby reach veritably the advantages such as quantitative standard, repeatability is high, expense is low, detection threshold is wide, signal noise is little.
The accompanying drawing explanation
fig. 1 illustrates library construction schema of the present invention;
fig. 2 illustrates information analysis schema of the present invention;
fig. 3 illustrates shown in Fig. 2 the distribution plan of sample one Reads on reference genome chr10 in application examples;
fig. 4 illustrates shown in Fig. 2 the results relevance analytical results figure of twice parallel laboratory test in application examples;
fig. 5 illustrates shown in Fig. 2 the distribution plan of sample one order-checking reads on gene in application examples.
Embodiment
in order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Unreceipted actual conditions person in embodiment, carry out according to the condition of normal condition or manufacturers's suggestion.The unreceipted person of production firm of agents useful for same or instrument, being can be by the conventional products of commercial acquisition.
the RNA-seq of two human tissue samples of embodiment analyzes
tissue samples is provided by Peking University.The library construction process as shown in Figure 1.Get total RNA sample of 0.1 μ g~2 μ g, digested with deoxyribonuclease I (DnaseI), ethanol deposition and purification digestion after product, used Oligo(dT)
25
magnetic bead takes out the mRNA in the total RNA of gained purifying, gained mRNA is mixed and reacts with the reagent I, obtain the mRNA of fragmentation, gained mRNA mixes with the reagent I mRNA that reacts the fragmentation obtained, through the synthetic cDNA of reverse transcription, use Ampure XP magnetic beads for purifying product, gained cDNA mixes and reacts with the reagent II, form the DNA fragmentation of flat end, use Ampure XP magnetic beads for purifying product, the flat end DNA fragmentation of gained mixes and reacts with the reagent III, obtain the DNA fragmentation that 3 ' ends add " A " base, mix and react with the reagent IV, must arrive the DNA fragmentation that two ends add joint, use Ampure XP magnetic beads for purifying product, adopt polymerase chain reaction (PCR) amplification gained DNA fragmentation, Ampure XP magnetic beads for purifying PCR product, upper machine order-checking.Illumina Hiseq2000 is used in order-checking.
the reagent I is: the 10-400mM magnesium chloride, and 200mM-300mM Tris-HCl, pH 8.0-8.5, solvent is water.
the reagent II is: 1.2uLT4 DNA polysaccharase (3U/ μ L), 1.2uLT4 polynueleotide kinase (10U/ μ L), 0.2ulKlenow DNA polysaccharase (5U/ μ L), 0.4uL 25mM dNTP; T4 polynueleotide kinase damping fluid contains 700 mM Tris-HCl, 100 mM magnesium chlorides, 50 mM DTT.
the reagent III is: 100 mM-500mM sodium-chlor, and 100 mM Tris-HCl, 10mM-50mM dithiothreitol (DTT), 5mM dATP, 0.2 μ L Klenow (3 '-5 ' exo) enzyme (5U/ μ L), pH7.6-7.9, solvent is water.
the reagent IV is: 100 mM Tris-HCl, and 10mM~50mM dithiothreitol (DTT), 5~10mM ATP, 1.2 μ L T4 DNA ligase enzymes, the pH value is 7.6~7.9, solvent is water.
fig. 2 shows the realization flow of digital gene express spectra upgrade version RNA-Seq (Quantification) the bioinformatic analysis method that the embodiment of the present invention provides, and details are as follows:
in step S1, receive the order-checking fragment that high throughput sequencing technologies obtains.In embodiments of the present invention, adopt Illumina Hiseq2000 order-checking.After receiving the primitive sequencer sequence, the primitive sequencer sequence is filtered, removed underproof sequence.Defective sequence comprises: the sequencing quality value surpasses 50% of whole piece sequence base number lower than 5 base number and thinks defective sequence; In sequence in sequencing result the uncertain base number of sequencing result surpass 10% of whole piece sequence base number and think defective sequence; With the sequence measuring joints sequence library, compare, if exist the sequence measuring joints sequence to think defective sequence in sequence.
in step S2, sample joint sequence in each sequence and sample joint sequence storehouse are compared, realize a minute sample operations, the sample joint sequence is removed from sequence fragment simultaneously.The sequence that the base number that has sequencing quality lower than 5 in joint sequence (the present embodiment is 8bp) is greater than to 3 is removed.
in step S3, the embodiment of the present invention adopts SOAPaligner/soap2, and the order-checking fragment that high throughput sequencing technologies is obtained is compared with reference to genome sequence and listed.
in step S4, the embodiment of the present invention is mainly that the mode with figure briefly provides Reads in each position of genome distribution situation roughly, and the distribution situation of this position gene.As Fig. 3 draws the distribution plan of Reads on the longest 1 karyomit(e) (or Scaffold), the distribution of sample one Reads on reference genome chr10.Wherein Gene refers to the number of gene in each window, and Coverage refers to the ratio of the zone that covered by reads under each window and length of window, and Reads refers to the average order-checking degree of depth of each window, and numerical value has been got log
2
.
in step S5, for weighing the how many standard of order-checking amount of sample, along with increasing of order-checking amount (reads quantity), the gene number detected also rises thereupon, when the order-checking amount reaches certain value, its gene detected is counted rate of growth and is tended towards stability, and illustrates that the gene number detected is tending towards saturated.
in step S6, the present invention calculates the expression amount of gene by the RPKM method, and its calculation formula is:.
in formula, the expression amount that RPKM (A) is Gene A, C is the reads number of unique comparison to Gene A, N is that unique comparison is to the total reads number with reference to gene, the base number that L is Gene A.The RPKM method can be eliminated mrna length and order-checking amount difference to calculating the impact of genetic expression, and the gene expression amount calculated can be directly used in the gene expression difference of more different sample rooms.
then, the present invention describes the attribute of gene comprehensively according to the Gene Ontology Gene Ontology of International standardization, comprising the bioprocess (biological process) of the molecular function (molec μ Lar function) of gene, residing cell position (cell μ Lar component), participation.
in step S7, thereby the present invention filters out the gene of differential expression by the data between more different samples, differential gene expression pattern clustering in subsequent analysis is analyzed, the enrichment of Gene Ontology function significance is analyzed, the enrichment of Pathway significance is analyzed, and the interactions between protein network analysis all is based on difference expression gene.
be published in differential gene detection method (the Audic S. and Claverie J. The Significance of Digital Gene Expression Profiles. Genome Research based on order-checking on Genome Research with reference to people such as Audic S., 1997 7:986-995.), screen the difference expression gene between two samples.
the similar gene of expression pattern has similar function usually.We utilize cluster software, take Euclidean distance as apart from the matrix calculation formula, and difference expression gene and experiment condition are carried out to the grade cluster analysis simultaneously.
the enrichment of function significance analyze provide with the reference genetic comparison after, the GO function entry of significant enrichment in difference expression gene, and filter out difference expression gene and which biological function significant correlation.This analyzes at first each term mapping to Gene Ontology database (http://www.geneontology.org/) all differences expressing gene, calculate the number gene of each term, then apply the hypergeometry check, find out with whole genome background and compare, the GO entry of significant enrichment in difference expression gene.
in vivo, different genes coordinates to exercise its biological function mutually, and the analysis based on pathway contributes to further to understand the biological function of gene.KEGG is the main public database of relevant pathway, and the enrichment of pathway significance is analyzed and be take KEGG pathway as unit, and the application hypergeometry is checked, and finds out the pathway of significance enrichment in the rear difference expression gene of comparing with whole genome.
the interactions between protein network analysis has been integrated the information of the interactive network databases such as BIND, BioGrid, HPRD, and the network in destination file is had the genomic constitution of direct interaction by difference expression gene and the different expressing gene of the manservant of an official.
in step S8, the present invention can obtain the assessment to experimental result reliability and operational stability to the results relevance analysis of twice parallel laboratory test.As shown in Figure 4, if the dependency between twice parallel laboratory test of same sample more approaches 1, illustrate that repeatability is higher.
in step S9, the present invention with reads the distribution situation on the reference gene estimate the random degree that mRNA interrupts.Because difference has different length with reference to gene, we reads the location criteriaization on the reference gene to relative position (position of reads on gene and the ratio of mrna length), then add up the reads number that the different positions of gene is compared.If it is good to interrupt randomness, reads should distribute more evenly at each position of gene.What Fig. 5 provided is sample one distribution of order-checking reads on gene.
description of the invention provides for example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selecting and describing embodiment is for better explanation principle of the present invention and practical application, thereby and makes those of ordinary skill in the art can understand the various embodiment with various modifications that the present invention's design is suitable for specific end use.