The processing method and processing unit of sequencing data
Technical field
The present invention relates to sequencing data process field, processing method and processing in particular to a kind of sequencing data
Device.
Background technology
Chromosome abnormality be probably in number or structure on.Quantity is abnormal, including trisomy (more chromosomes),
Monosomy (losing a chromosome) and polyploidy (entirely more a set of chromosome).Textural anomaly includes being caused by chromosome breakage etc.
Structural rearrangement, such as transposition, upset, missing and insertion.
Chromosome quantitative is abnormal, such as aneuploidy and polyploidy, includes inborn defect with a variety of diseases and cancer is relevant.I
The annual neonate of state nearly 20,000,000, wherein about 4%~6% has inborn defect, wherein fetal chromosomal abnormalities are clinical most normal
One of inborn defect type seen, it is abnormal chromosome patients just to have 1 in about 160 neonates according to statistics.Chromosome trisomy
Syndrome is that incidence of disease highest is a kind of in chromosomal disorders, when the number of certain intracellular chromosome be not normal two but
Three, namely total chromosome number mesh be 47 when may result in patau syndrome.Most common trisomic syndrome has:21
Patau syndrome (T21), Edwards syndrome (T18) and Patau syndrome (T13).To reduce the ratio of inborn defect baby
Example, the fast and accurately detection to chromosomal aneuploidy is necessary.
Ultrasound scanning or the non-invasive methods of biochemical markers examination, have been used for carrying out the wind of chromosome abnormality
Danger judges, but this method accuracy rate is relatively low, only 60-80%, and the influence of the physiologic factor such as age of becoming pregnant.And the antenatal of routine is examined
Disconnected method is then needed by invasive method such as amniocentesis or chorionic villus sampling, therefore risk of miscarriage be present, and detects week
Phase is longer.1997, the acellular foetal DNA (Lancet.1997Aug 16 of circulation is found that in Maternal plasma;350
(9076):485-7.Presence of fetal DNA in maternal plasma and serum.Lo YM1,
Corbetta N,Chamberlain PF,Rai V,Sargent IL,Redman CW,Wainscoat JS.).1999, hair
Now nourish and the concentration of foetal DNA is circulated in women's blood plasma of No. 21 chromosome trisomy fetuses apparently higher than nourishing euploid fetus woman
Concentration (Lo, Y.M.D.et al., the Clin Chem 45 of foetal DNA is circulated in female's blood plasma:1747-1751(1999);Zhong,
X.Y.et al., Prenat Diagn 20:795-798(2000)).It is above-mentioned be found to be noninvasive pre-natal diagnosis provide it is new can
Can property.On this basis, antenatal noninvasive field achieves many progress, such as by methyl-sensitive enzyme enriches fetal DNA to drop
Low parent ambient interferences (PCT/US2004/033175 2004.10.08);By PCR compare gene-specific fragments Ct values with
No. 21 three bodies (CN200610003103.9,2006.02.10) of examination;Pushed away by the amplified allele detection based on RNA-SNP
Disconnected fetal chromosomal aneuploidy (CN200680007354.2,2006.03.17).But the time-consuming consumption of enrichment to foetal DNA
Power, and amplification technique requires the specificity of sequence or the heterozygosity of gene, makes it be difficult to turn into a kind of general technology.
2008, Rossa W.K.Chiu et al. proposed that sequencing means can obtain the bulk information of peripheral blood nucleic acid molecule
(Rossa W.K.Chiu, et al.Noninvasive prenatal diagnosis of fetal chromosomal
aneuploidy by massively parallel genomic sequencing of DNA in
Maternalplasma.PNAS, 2008,105:20458-20463), and find there is abnormal sample on clinical meaning chromosome
In this, its clinical meaning chromosome has the ratio of the amount of abnormal nucleic acid molecules and the amount of the nucleic acid molecules of background stainings body ginseng
Number, the parameter of one or more normal control values with being built by normal sample have differences.Thus, based on high-flux sequence
Method can be used to detect chromosome abnormality, and eliminate to distinguished sequence amplification dependence.But existing gene order-checking inspection
Survey method needs sample to be tested and multiple samples or standard normal sample being compared, and time-consuming, to sample requirement amount
(e.g., Application No. CN200880108377.1 Chinese patent application) greatly, and to the uniformity of each batch sample experiment condition
There is strict demand, constrain its facilitation and high-throughout application.
Therefore, it is still necessary to the method for existing processing sequencing data is improved, to improve the accuracy of data processing.
The content of the invention
It is a primary object of the present invention to provide the processing method and processing unit of a kind of sequencing data, to improve to sequencing
The accuracy of data processing.
To achieve these goals, according to an aspect of the invention, there is provided a kind of processing method of sequencing data, is somebody's turn to do
Processing method includes:The nucleotide sequence that all chromosomes from maternal peripheral blood sample are obtained by high-flux sequence is believed
Breath;Reference gene group is divided into multiple specific regions, the number N RSc of non repetitive sequence is equal in each specific regions;
It is multiple special to reference gene group by being distributed from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample
Property region, NRSs value of the statistical sample in each specific regions;Using G/C content amendment sample in each specific regions
Interior NRSs values, are designated as NRSs' values;Based on NRSs' values, all specificity on target chromosome and control chromosome are counted respectively
The average of the NRSs' values in region, correspond to be designated as the first average and the second average respectively;It is poor that first average and the second average are carried out
The opposite sex is examined, and determines that chromosome whether there is aneuploidy according to difference test result.
Further, included using the step of NRSs values of the G/C content amendment sample in each specific regions:Utilize
Correction formula NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions, wherein,For the I d median of all specific regions NRSs values, NRSs " is to utilize each special of sample
Property region the progress polynomial spline fitting of G/C content and NRSs values after the match value that obtains.
Further, polynomial spline fitting is carried out with NRSs values in the G/C content of each specific regions using sample
Before, the step of processing method also includes removing the specific regions of NRSs values exception from all specific regions of sample,
It is preferred that the method being fitted using linear fit or local polynomial regression removes the abnormal specific regions of NRSs values.
Further, NRSc values are the arbitrary integer in 10000~50000.
Further, target chromosome is selected from one or several following any combination:No. 13 chromosomes, No. 18 dyeing
Body, No. 21 chromosomes, X chromosome and Y chromosome;Compare chromosome and be selected from one or several following any combination:No. 1 dye
Colour solid, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, 9
Number chromosome, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes;Preferably, chromosome is compareed selected from following any one
Bar or the combination of several:No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes,
No. 12 chromosomes and No. 16 chromosomes.
To achieve these goals, according to another aspect of the present invention, there is provided a kind of processing unit of sequencing data,
The processing unit includes:Sequencer module, for obtaining all dyeing from maternal peripheral blood sample by high-flux sequence
The nucleotide sequence information of body;Specific regions division module, for being drawn reference gene group according to the equal principle of NRSc values
It is divided into multiple specific regions;Distribute module, for according to the principle that sequence alignment is carried out with reference gene group, mother will to be derived from
The nucleotide sequence information of all chromosomes of peripheral body sample is distributed to multiple specific regions of reference gene group;First
Statistical module, for NRSs value of the statistical sample in each specific regions;Correcting module, for utilizing G/C content amendment sample
Originally the NRSs values in each specific regions, are designated as NRSs' values;Second statistical module, for based on NRSs' values, uniting respectively
Count target chromosome and compare the average of the NRSs' values of all specific regions on chromosome, be designated as the first average and second equal
Value;Inspection module, for the first average and the second average to be carried out into test of difference;Determining module, for according to difference test
As a result determine that chromosome whether there is aneuploidy.
Further, correcting module includes:First computing unit, for calculating the middle position of all specific regions NRSs values
Numerical valueFitting unit, G/C content and NRSs values for each specific regions using sample carry out polynomial spline
Fitting, obtains matched curve;Acquiring unit, for obtaining the match value NRSs " of each specific regions according to matched curve;The
Two computing units, for basisFormula calculates correction factor α;Amending unit, for according to correction formula
NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions.
Further, fitting unit is more using the G/C content of each specific regions of sample and the progress of NRSs values in execution
Formula spline-fit, before the step of obtaining matched curve, fitting unit also includes filtering subelement, and filtering subelement is used to hold
Row removes the step of NRSs values abnormal specific regions from all specific regions of sample, and it is line preferably to filter subelement
Property fitting subelement or local polynomial regression fitting subelement.
Further, NRSc values are the arbitrary integer in 10000~50000.
Further, target chromosome is selected from one or several following any combination:No. 13 chromosomes, No. 18 dyeing
Body, No. 21 chromosomes, X chromosome and Y chromosome;Compare chromosome and be selected from one or several following any combination:No. 1 dye
Colour solid, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, 9
Number chromosome, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes;Preferably, chromosome is compareed selected from following any one
Bar or the combination of several:No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes,
No. 12 chromosomes and No. 16 chromosomes.
Apply the technical scheme of the present invention, by based on sequencing data, by with the non repetitive sequence of equal bar number
Specific regions are divided for principle, avoid in each specific regions number caused by non-repetitive sequences number heterogeneity
According to fluctuation, and then optimize the correlation of interchromosomal nucleic acid data parameters, using with clinically relevant chromosome in biological specimen
Parameter and the parameter in other non-clinical relative chromosome areas compare, so that it is determined that in sample to be tested chromosomal aneuploidy whether
In the presence of.It the method achieve single pattern detection, it may not be necessary to the normal sample of standard, eliminate the dependence to experiment condition
Property, accelerate analyze speed, be kind simple, fast and accurately detection means, the accuracy rate of its autosome detection 99% with
On, false positive rate is less than 1%.
Brief description of the drawings
The Figure of description for forming the part of the application is used for providing a further understanding of the present invention, and of the invention shows
Meaning property embodiment and its illustrate be used for explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows S001 samples (negative sample) sequencing sequence in a kind of preferred embodiment 1 according to the present invention
Distribution schematic diagram of the middle non repetitive sequence on genome in each specific regions;And
Fig. 2 shows that the non repetitive sequence in Fig. 1 in S001 samples sequencing sequence after Exception Filter value is each on genome
Distribution schematic diagram in specific regions;
Fig. 3 shows that the non repetitive sequence in Fig. 2 in S001 samples sequencing sequence after Exception Filter value is each on genome
Spline curve fitting figure in specific regions;
Fig. 4 a and Fig. 4 b respectively illustrate before the autosomal amendments of each bar of S001 samples in embodiment 1 and revised
The number of non repetitive sequence in specific regions;Wherein, before Fig. 4 a displays amendment, after Fig. 4 b display amendments;
Fig. 5 a and Fig. 5 b respectively illustrate the autosomal amendment of each bar of S002 samples in another preferred embodiment
The number of non repetitive sequence in preceding and revised specific regions;Wherein, before Fig. 5 a displays amendment, Fig. 5 b display amendments
Afterwards;
Before Fig. 6 a and Fig. 6 b respectively illustrate the autosomal amendment of each bar of S007 samples in another preferred embodiment
With the number of the non repetitive sequence in revised specific regions;Wherein, before Fig. 6 a displays amendment, after Fig. 6 b display amendments;
Fig. 7 a and Fig. 7 b respectively illustrate the autosomal amendment of each bar of S006 samples in another preferred embodiment
The number of non repetitive sequence in preceding and revised specific regions;Wherein, before Fig. 7 a displays amendment, Fig. 7 b display amendments
Afterwards;
Fig. 8 a, Fig. 8 b and Fig. 8 c respectively illustrate in embodiments herein 2 No. 13 dye in 384 online data samples
The Z Distribution value figures of colour solid, No. 18 chromosome and No. 21 chromosome, wherein, Fig. 8 a show No. 13 chromosome, and Fig. 8 b are shown
No. 18 chromosome, Fig. 8 c show No. 13 chromosome.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Term is explained:
Sequencing data:Refer to sample to be tested and pass through the nucleotide sequence information that high-flux sequence obtains.
kmer:Sequence is continuously cut in a manner of moving base one by one, obtained sequence length is k nucleotides
Sequence, such as this following sequence:ATCGTTGCTTAATGACGTCAGTCGAAT, if if 13-mer is analyzed, k-mer
For ATCGTTGCTTAAT, TCGTTGCTTAATG, CGTTGCTTAATGA, GTTGCTTAATGAC ....
Non-repetitive sequences (non-repeated sequence, abbreviation NRS);By the way that sample to be tested is sequenced into what is obtained
Sequence is compared with normal human subject genome, and unique kmer in the full-length genome level of acquisition is non-repeatability sequence
Row.In the application, according to wait bar number non repetitive sequence to divide specific regions when, the bar number of division is according to reference gene
Sequence is organized to be divided, thus, the bar number scale of non repetitive sequence is NRSc in each specific regions for dividing to obtain, and
The bar number scale of the actual non repetitive sequence in above-mentioned each specific regions of the sequencing sequence of sample to be tested is NRSs.
Specific regions (specified region, abbreviation SR), according to specific regions described in the present invention
The specific region on each bar chromosome of genome obtained by division methods.
Chromosome:Both whole chromosome can be referred to, a part for chromosome can also be referred to.Handle item chromosome fragment
Mathematical derivation is consistent with the mathematical derivation of all chromosome segments of processing, and those skilled in the art knows corresponding change
Method.Control chromosome is chromosome or the normal chromosome of presumption in healthy individuals, including statistics presumption is normally, here
Chromosome be that individual chromosome or genome (are more than or equal to 2 chromosomes, are non-13,18,21, X, Y dyeing in other words
Body or its any combination).
" aneuploidy " and " polyploidy " is the chromosome number and common haploid number n or amphiploid number 2n that cell has
Different situations.Aneuploid cell can be the cell with triploid, i.e., three copy numbers with a chromosome it is thin
Born of the same parents;Or be monoploid, i.e. the cell singly copied with a chromosome.Chromosomal aneuploidy, change homologue
Expression quantity, bioinformatic analysis method can be combined by new-generation sequencing platform (NGS), according to sequencing comparison result system
The expression quantity for counting each bar chromosome can be determined that sample to be tested whether there is the Dysploid of the chromosome.
Sample is cell, tissue or body fluid, be may be selected from:Maternal whole blood (peripheral blood), blood plasma, serum, urine, saliva, life
Grow flushing liquor;Biopsy material before fetal cell or fetal cell residue, Embryonic limb bud cell;Amniotic fluid, chorionic villi sample etc..
Sample may be from any animal, preferably mammal, more preferably people.
It can be the short sequence of both-end, single-ended long sequence or single-ended short sequence that sequencing is carried out to DNA sequencing library
Sequencing.Wherein the short sequence of both-end refers to the and then sequence less than 50bp of 5 ' end link primers and and then 3 ' holds and link primer
The sequence less than 50bp.Preferably, the short sequence of both-end refer to and then the sequence no more than 36bp of 5 ' end link primers with
And then the sequence no more than 36bp of 3 ' end link primers.
Single-ended short sequence refers to the and then sequence less than 50bp of 5 ' end link primers or and then 3 ' ends link primer
The sequence less than 50bp.Preferably, single-ended short sequence refer to the and then sequence no more than 36bp of 5 ' end link primers or
And then the sequence no more than 36bp of 3 ' end link primers.Single-ended long sequence refers to that and then 5 ' ends link being more than for primers
99bp sequence or the and then sequence more than 99bp of 3 ' end link primers.Both-end sequencing refers to test respectively positioned at sequence two
The sequence at end.The single-ended sequence for referring to be pointed to sequence one end that is sequenced is sequenced.
Because the detection method of existing chromosomal aneuploidy still suffers from shortcoming in terms of accuracy and convenience, in order to
Improve this situation, in a kind of typical embodiment of the application, there is provided a kind of processing method of sequencing data, the processing
Method includes:The nucleotide sequence information of all chromosomes from maternal peripheral blood sample is obtained by high-flux sequence;
Reference gene group is divided into multiple specific regions, number (being designated as NRSc) phase of non repetitive sequence in each specific regions
Deng;It will be distributed from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample to multiple spies of reference gene group
Specific region, NRSs value of the statistical sample in each specific regions;Using G/C content amendment sample in each specific area
NRSs values in domain, are designated as NRSs' values;Based on NRSs' values, count all special on target chromosome and control chromosome respectively
Property region NRSs' values average, respectively correspond to be designated as the first average and the second average;First average and the second average are carried out
Test of difference, determine that chromosome whether there is aneuploidy according to difference test result.
The above-mentioned processing method of the application, by based on sequencing data, by with the non repetitive sequence of equal bar number
Specific regions are divided for principle, avoid in each specific regions number caused by non-repetitive sequences number heterogeneity
According to fluctuation, and then optimize the correlation of interchromosomal nucleic acid data parameters, using with clinically relevant chromosome in biological specimen
Parameter and the parameter in other non-clinical relative chromosome areas compare, so that it is determined that in sample to be tested chromosomal aneuploidy whether
In the presence of.It the method achieve single pattern detection, it may not be necessary to the normal sample of standard, eliminate the dependence to experiment condition
Property, accelerate analyze speed, be kind simple, fast and accurately detection means, the accuracy rate of its autosome detection 99% with
On, false positive rate is less than 1%.
Specifically, the method for above-mentioned test of difference can be existing various test of difference, such as, Z test (Z-
Test), u-test or t inspections etc..The preferred Z test of the application.
In above-mentioned processing method, using can be with the step of NRSs value of the G/C content amendment sample in each specific regions
The accuracy of detection can be also improved using existing GC bearing calibrations.In order that detection accuracy is higher, it is excellent in the application one kind
In the embodiment of choosing, above-mentioned modification method includes:Sample is corrected in each specificity using correction formula NRSs'=NRSs × α
NRSs values in region, wherein,For the I d median of all specific regions NRSs values,
NRSs " is the fitting obtained after G/C content and NRSs values the progress polynomial spline fitting for each specific regions for utilizing sample
Value.Revised NRSs' more Normal Distributions, so that follow-up test of difference result is more accurate.
Fitting is discrete point (G/C content is X, the coordinate of Y-axis with NRSs values) { f1, f2 ..., fn } known to, is passed through
Adjust some undetermined coefficient f (λ in fitting function1,λ2..., λ n) so that difference (the least square meaning of the function and known point set
Justice) it is minimum.Known point (xi,Yi);x1< x2< ... < xn, i ∈ Z are a series of observations, meet certain relational expressionBuild fitting functionSo that:Yi=μ (xi) minimum.If fitting function is non-thread
Property function, then referred to as nonlinear fitting, is also called spline-fit.Accordingly, if fitting function is multinomial, can claim
For polynomial spline be fitted.Preferably polynomial spline fitting of the invention, SPL is smooth cubic curve.
Cubic spline curve gives n data point, shares n-1 section, and the equation in each section is:fi=ai+bi(x-
xi)+ci(x-xi)2+di(x-xi)3, 4 (n-1) individual unknowm coefficients need to be determined, by first derivative at continuity, node it is equal, two
Order derivative is equal, can obtain 4n-6 equation, then artificially 2 boundary conditions of addition.Pass through the function of R software systems
Smooth.spline completes spline-fit (http://www.stat.wisc.edu~xie/smooth_
splinetutorial.html)。
It is above-mentioned before polynomial spline fitting is carried out using the G/C content of each specific regions of sample and NRSs values
The step of processing method also includes removing the specific regions of NRSs values exception from all specific regions of sample, it can adopt
Exceptional value is removed with the method for GC linear fits or by way of artificial screening, for example it is 0, non repetitive sequence to delete GC values
The window that number is 0 or non repetitive sequence number is significantly excessive.In this application, it is preferred to use local polynomial regression is fitted
Method remove the specific regions of NRSs values exception, this method is advantageous to discharge part non-specificity region because of chromosome structure
The too high or too low exquisite specificity region of the number of internal non repetitive sequence caused by specificity.In addition it is also possible to using
Linear fit approximating method.Approximating method is the method for the conventional removal exceptional value of statistics or field of bioinformatics, specifically
Method will not be repeated here.
In above-mentioned processing method, divide specific regions when be to be divided according to the equal principle of NRSc values, have
Body NRSc values can be determined according to modes such as the Genome Size of sample to be tested, sequence complexities.It is preferred that NRSc values are
Arbitrary integer in 10000~50000.
In above-mentioned processing method, tissue, cell derived that target chromosome and control chromosome can be according to samples to be tested
The different of different or species different or actually detected demands carry out reasonable selection.When sample to be tested is the mankind, preferably
Target chromosome is selected from one or several following any combination:No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X dyeing
Body and Y chromosome;Compare chromosome and be selected from one or several following any combination:No. 1 chromosome, No. 2 chromosomes, No. 3
Chromosome, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 dyeing
Body, No. 11 chromosomes and No. 12 chromosomes;It is highly preferred that control chromosome is selected from one or several following any combination:1
Number chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and 16
Number chromosome.
In another typical embodiment of the application, a kind of processing unit of sequencing data, the processing are additionally provided
Device includes:Sequencer module, for obtaining the core of all chromosomes from maternal peripheral blood sample by high-flux sequence
Nucleotide sequence information;Specific regions division module is more for being divided into reference gene group according to the equal principle of NRSc values
Individual specific regions;Distribute module, for according to the principle that sequence alignment is carried out with reference gene group, maternal peripheral will to be derived from
The nucleotide sequence information of all chromosomes of blood sample is distributed to multiple specific regions of reference gene group;First statistics mould
Block, for NRSs value of the statistical sample in each specific regions;Correcting module, for utilizing G/C content amendment sample every
NRSs values in individual specific regions, are designated as NRSs' values;Second statistical module, for based on NRSs' values, counting target respectively
The average of the NRSs' values of all specific regions, is designated as the first average and the second average on chromosome and control chromosome;Examine
Module, for the first average and the second average to be carried out into test of difference;Determining module, for being determined according to difference test result
Chromosome whether there is aneuploidy.
Above-mentioned detection device with improved specific regions by based on the sequencing data that sequencer module obtains, drawing
Sub-module divides specific regions using the non repetitive sequence of equal bar number as principle, optimizes interchromosomal nucleic acid data parameters
Correlation, then by perform successively distribute module, the first statistical module, correcting module, the second statistical module, examine mould
Block, compared using with the parameter of clinically relevant chromosome in biological specimen and the parameter in other non-clinical relative chromosome areas, really
Cover half block determines that chromosomal aneuploidy whether there is in sample to be tested eventually through the test of difference result of inspection module.Should
Device realizes the detection of single sample sequencing data, and does not need the normal sample of standard, eliminates the dependence to experiment condition
Property, accelerate analyze speed, assessment of the general improvements to chromosome abnormality.It is that one kind is simple, fast and accurately chromosome is non-
Ortholoidy detection means, for the accuracy rate of its autosome detection more than 99%, false positive rate is less than 1%.
Specifically, above-mentioned inspection module can be existing various test of difference modules, such as, Z test (Z-test)
Module, u-test module or t inspection modules etc..The preferred Z test module of the application.
Above-mentioned correcting module can also improve the accuracy of detection using existing GC correcting modules.It is a kind of preferably in the application
Embodiment in, above-mentioned correcting module includes:First computing unit, for calculating the median of all specific regions NRSs values
ValueFitting unit, G/C content and NRSs values for each specific regions using sample carry out polynomial spline plan
Close, obtain matched curve;Acquiring unit, for obtaining the match value NRSs " of each specific regions according to matched curve;Second
Computing unit, for basisFormula calculates correction factor α;Amending unit, for according to correction formula
NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions.
In above-mentioned preferred embodiment, have the fitting degree of accuracy high by using the fitting unit of polynomial spline fitting
Advantage, in order to more accurately obtain match value, correspondingly, the correction factor being calculated by the second computing unit is also more accurate
Really, and then NRSs value of the sample to be tested in each specific regions can be more accurately obtained by amending unit, that is, obtained
The higher NRSs' values of the degree of accuracy.
In above-mentioned processing unit, fitting unit is performing G/C content and NRSs values using each specific regions of sample
Polynomial spline fitting is carried out, before the step of obtaining matched curve, fitting unit also includes filtering subelement, filters subelement
The step of removing the specific regions of NRSs values exception from all specific regions of sample for performing, can so enter one
Step improves fitting unit and is fitted the degree of accuracy in polynomial spline fit procedure is carried out.It is preferred that filtering subelement is using conventional line
Property fitting subelement or local polynomial regression fitting subelement carry out exceptional value filtering.
Preferably, NRSc values are the arbitrary integer in 10000~50000 in above-mentioned processing unit.
In above-mentioned processing unit, tissue, cell derived that target chromosome and control chromosome can be according to samples to be tested
The different of different, person's species different or actually detected demands carry out reasonable selection.When sample to be tested is the mankind, preferred mesh
Mark chromosome and be selected from one or several following any combination:No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X chromosomes
And Y chromosome;Compare chromosome and be selected from one or several following any combination:No. 1 chromosome, No. 2 chromosomes, No. 3 dyes
Colour solid, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes,
No. 11 chromosomes and No. 12 chromosomes;It is highly preferred that control chromosome is selected from one or several following any combination:No. 1
Chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16
Chromosome.
The above method and its device of the application can be combined with other known method, apparatus or composition, can preferably be improved
The method, apparatus or composition of chromosome abnormality detection technique.For example, the mathematics model analysis of parent biochemical indicator.
The above method provided herein, it has the excellent of high flux, low cost, simplicity, the degree of accuracy and high sensitivity
Gesture.Existing method needs sample to be tested and multiple samples or standard normal sample being compared, and time-consuming, and to sample
This demand is big.The application realizes single pattern detection, can be avoided independent of the normal sample of standard to experiment condition
Dependence, accelerate analyze speed and improve Detection accuracy.
The such scheme that the application provides is to be combined DNA sequencing means with the method for analysis of biological information, passes through Z values
The otherness method of inspection such as examine to judge chromosome with the presence or absence of abnormal.If Z values are outside 4.5, it may be determined that be dyeing be present
Body aneuploidy.Chromosome abnormality is preferably No. 21 chromosome trisomies, No. 13 chromosome trisomies, No. 18 chromosomes, X chromosome and
The exception of Y chromosome.
The application method is particularly suitable for use in, and detection chromosome quantitative is abnormal, and preferably chromosomal aneuploidy quantity is abnormal, more
It is preferred that autosome aneuploidy is abnormal.
Further illustrate the beneficial effect of the application below in conjunction with specific embodiments.
Processing method of the embodiment 1 to sample to be tested sequencing data
(1) high-flux sequence is carried out to the DNA fragmentation that dissociates in sample to be tested maternal blood
(1) pregnant woman's whole blood is gathered, blood plasma is obtained by pretreatment;
After notice of consent is approved, by venipuncture from 22 weeks women of pregnancy (i.e. sample S001 in continued 2 afterwards)
Blood blood sampling volume 5-10ml is taken, is added in ethylenediamine tetra-acetic acid (EDTA) pipe, blood sample is removed after high speed centrifugation
The plasma sample of haemocyte, each sample plasma volume is about 700ul.
(2) plasma dna is extracted;
The DNA extraction agents box HiPure Circulating DNA Kits that are produced using Magen companies extract blood plasma
In DNA (production number D3180-02).
(3) DNA for extracting to obtain from blood plasma is prepared into the library for being available for high-flux sequence platform sequencing of new generation
Plasma dna carries out end reparation using T4DNA polymerases, T4PNK and Klenow enzymes and adds A processing, uses T4DNA
Ligase and sequence measuring joints carry out adjunction head processing.Finally use the library primer added with label to enter performing PCR, and entered using magnetic bead
Row purifying screening, finally give the sequencing library of machine.
(4) DNA sequencing is carried out to the library prepared
Sequencing library expands on Illumina cBot instruments, and DNA clusters are made in the single-ended sequencing libraries of DNA, obtain magnanimity
Sequencing reading length is 36bp sequence.
(2) sequence information of the DNA fragmentation in blood plasma is determined
1. pair normal human subject reference gene group carries out specific regions division and statistics
(1) non-repetitive sequences are screened
By mankind's reference gene group (hg19GRCh37http://www.ncbi.nlm.nih.gov/projects/
Genome/assembly/grc/), it is 35bp to be cut into length, and the magnanimity kmer that offset is 1bp gathers;Therefrom screening obtains
Unique kmer, i.e. non-repetitive sequences on full-length genome, and location coordinate information corresponding to record.
(2) specific regions divide
From first non-repetitive sequences start recording original position of No. 1 chromosome, until remembering when being accumulate to 20000
Its final position is recorded, first specific regions this being defined as on No. 1 chromosome, is not present between each specific regions
It is overlapping.
For No. 1 chromosome until Y chromosome repeats the processing step of top, all chromosome specific regions are obtained
Positional information and G/C content (specific regions division is carried out to normal human subject reference gene group need to only be carried out once, follow-up every
The specific regions that individual testing sample divides according to reference gene group are handled).
(3) specific regions count
Count the G/C content distribution of all non-repetitive sequences in specific regions quantity and the region on each bar chromosome
Situation.
2. sample DNA sequence alignment
Software BWA (Burrows-Wheeler Aligner) is compared by Bioinformatic Sequence, the DNA of gained will be sequenced
Sequence carries out not fault-tolerant compare with normal human subject reference gene group (hg19, GRCh37) and (matching completely, is not allow for base mistake
With), determine detailed location information of all sequencing DNA sequence dnas on genome, including the coordinate on chromosomal origin, chromosome
And genome specificity Regional Distribution of Registered etc. (in table 2 in S001 samples sequencing sequence non repetitive sequence on genome
Distribution situation in each specific regions is shown in Fig. 1).
(3) expression quantity of chromosome to be measured is determined
1st, Exception Filter value
By the number of non-repetitive sequences in the G/C content in the genome specificity region of sample to be tested and the region
(NRSs) local polynomial regression fitting (linear fit also can) is carried out by loess functions, by NRSs numbers in match value positive and negative 3
(p outside times standard deviation<0.005) definition is exceptional value, and the distribution after exceptional value is filtered is as shown in Figure 2.
2nd, weighting amendment
After all specific regions of the genome of sample to be tested are classified according to G/C content, SPL plan is carried out
Conjunction obtains the match value of NRSs corresponding to each G/C content, is designated as NRSs ", its corresponding distribution situation is as shown in Figure 3.
Wherein, specific fit procedure is:With NRSs I d medianFor baseline, by NRS match value NRSs " with
For baseline value compared to correction factor α is obtained, calculation formula is as follows;
NRSs'=NRSs × α (2)
Above-mentioned formula is calculated for each specific regions on sample to be tested genome, wherein,Refer to
Be NRS numbers on all specific regions on genome I d median, NRSs " is match value, and NRSs' is revised non-heavy
Complex sequences number.
Before amendment from figure 4 below a and Fig. 4 b, Fig. 5 a with Fig. 5 b, Fig. 6 a and Fig. 6 b and Fig. 7 a and Fig. 7 b and after amendment
As can be seen that unmodified data fluctuations are bigger, directly carry out the otherness between chromosome and be easier to cause false the moon
The testing result of property or false positive.And the non repetitive sequence number distribution situation after correcting in the specific regions of each chromosome becomes
In steady, data variance is more notable, it is easier to and judge exceptional value, show that the present processes can eliminate GC architectural differences, and
Avoid GC preference sex chromosome mosaicisms.The detection abnormal available for chromosomal aneuploidy, reduces the appearance of false negative result, below figure 7a
With Fig. 7 b chr21 corresponding to NRS numbers be higher by with other autosomes are obvious, corresponding testing result is the sample 21
It is high the abnormal risk of aneuploidy to be present in chromosome.
(4) Z values test and judge chromosomal expression amount whether there is significant difference
With NRSs through the revised NRSs' of GC, by the institute of target detection chromosome (chr21, chr18, chr13, X or Y)
Have the NRSs' of specific regions average, with compare chromosomal (chr1, chr2 ... chr12) all NRSs' it is equal
Value carries out otherness comparison, obtains detected value Z (Z-score), judges that current target chromosome whether there is non-multiple according to Z values
Property variation.When Z-score >=4.5 or Z-score≤- 4.5, i.e. testing result is high for three bodies variation excessive risk, or monomer variants
Risk;As -4.5 < Z-score < 4.5, i.e. testing result is Dysploid low-risk.
Or the distribution situation by house-keeping gene, filter out control chromosomal, including chr1, chr2, chr3,
chr6、chr7、chr11、chr12、chr16。
The efficiency evaluation of embodiment 2
(1) evaluated using online data sample
The step in processing method shown in embodiment 1 can module or unit form by computing device come real
It is existing.In order to evaluate the validity of the method for embodiment 1, below with the processing for module or the unit formation for being able to carry out above-mentioned steps
Device is tested.The processing unit includes:
Sequencer module, for obtaining the nucleosides of all chromosomes from maternal peripheral blood sample by high-flux sequence
Acid sequence information;
Alternatively, the cBot instruments of above-mentioned module including Illumina, Illumina Genome Analyzer,
The supporting model sequenator such as HiSeq2000/2500, Hiseq3000/4000, NextseqCN500 or Life
The module of sequencing function is performed in the supporting sequenator such as the SOLiD of Technologies companies.
Specific regions division module, specific regions division module program is called, will according to the equal principle of NRSc values
Reference gene group is divided into multiple specific regions;Can be according to any integer bar in 10000~50000 (preferably 20000)
Non repetitive sequence is that unit is divided, non-duplicate by existing in length such as the specific regions of 20Kb or 50Kb division to overcome
Sequence number differs greatly and the defects of data homogeneity difference.Distribute module, distribute module is run, the knot that sequencer module is exported
Fruit carries out sequence alignment with reference gene group, by from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample
Distribution is to caused by specific regions division module in specific regions;
Alternatively, module such as BWA modules, BOWTIE modules or the NOVOALIGN moulds of sequence alignment principle are able to carry out
Block is used for carrying out the distribution of sample to be tested sequencing data,
First statistical module, for NRSs value of the statistical sample in each specific regions;Statistical module alternatively
There are SAMTOOLS modules;
Correcting module, for the NRSs values using G/C content amendment sample in each specific regions, it is designated as NRSs'
Value;
Preferably, correcting module includes:First computing unit, for calculating the median of all specific regions NRSs values
ValueFitting unit, G/C content and NRSs values for each specific regions using sample carry out polynomial spline plan
Close, obtain matched curve;Acquiring unit, for obtaining the match value NRSs " of each specific regions according to matched curve;Second
Computing unit, for basisFormula calculates correction factor α;Amending unit, for according to correction formula
NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions.
Second statistical module, for based on NRSs' values, counting all special on target chromosome and control chromosome respectively
The average of the NRSs' values in property region, is designated as the first average and the second average;
Inspection module, for the first average and the second average to be carried out into test of difference;Alternatively, using Z test module
To carry out difference analysis;
Determining module, for determining that chromosome whether there is aneuploidy according to difference test result;
Preferably, when target chromosome is autosome, during and -4.5≤Z values≤4.5, for determining target chromosome not
Aneuploidy be present, otherwise, it determines aneuploidy be present.
With from different experiments room, different NGS platforms data (from NCBI SRA databases http://
The noninvasive prenatal gene detection project clinical research that other mechanisms downloaded in www.ncbi.nlm.nih.gov/sra/ upload is pregnant
The high-flux sequence data of woman's peripheral blood, wherein including 384 sample datas) filled for sample to further illustrate that the application is handled
The validity and versatility put.
Wherein, it is as shown in table 1 below for No. 21 in 384 samples, the testing result of No. 18 and No. 13 chromosomes:
1. 384 NCBI online datas positive sample detection results of table.
It is attached:In upper table 1, " chr " represents chromosome;" gc " represents G/C content;" ZV " represents Z Value, Z values;“TEST”
Represent the chromosomal aneuploidy abnormality detection result obtained by this method.
From above-mentioned table 1 and it was found from Fig. 8 a, Fig. 8 b and Fig. 8 c, 1 T13 positive SRR358477, remaining sample are detected
No. 13 chromosomes the equal Stable distritation of Z values in (- 4.5,4.5) section;5 T18 positives SRR357943 of detection,
SRR357972, SRR358089, SRR358257, SRR358325, the equal Stable distritation of Z values of No. 18 chromosomes of remaining sample (-
4.5,4.5) in section;Detect 7 T21 positives SRR357843, SRR358020, SRR358126, SRR358144,
SRR358322, SRR358352, SRR358353, the equal Stable distritation of Z values of No. 21 chromosomes of remaining sample is in (- 4.5,4.5) area
In.
(2) blood sample is evaluated
Evaluated by taking the testing result of 68 samples (being provided by Ministry of Public Health's visiting center and Beijing people hospital) as an example
Processing method is stated, evaluation result is shown in Table 2, the result of 30 samples before only being shown in table 2, and is verified with results of karyotype.
The blood sample testing result of table 2.
Note:Represent be target detection chromosome it is unmodified before, in all specific regions
NRS average value.What is represented is the chromosome after the weighting amendment of specific regions G/C content
Upper all specific regions NRS average value.ZVchri(i=13,18,21) what is represented is that the chromosome passes through with compareing dyeing
Body carries out the Z values that significant difference is analyzed to obtain.What TEST was represented is that the chromosomal aneuploidy obtained by this method is abnormal
Testing result, N (Negative) represent that testing result for feminine gender, does not detect obvious exception.T13/T18/T21 represents testing result
Aneuploidy exception be present in display target detection chromosome.What caryogram (Karyotype) represented is clinical karyotyping result, i.e.,
Goldstandard result (46, XN represent be caryogram normal specimens chromosome number and sex chromosome situation, 47, XN ,+21 generations
Table is that the sample karyotyping shows 47 chromosomes, and No. 21 chromosomes, i.e. Tang Shi synthesis are had more than normal karyotype
Sign).
The as shown by data of table 2, according to significant difference assay:Two sample ZV of S0002 and S0013chr13It is all higher than
In 4.5, judge that No. 13 chromosomal aneuploidies anomaly exist excessive risk;Two sample ZV of S0007 and S0012chr18It is all higher than
In 4.5, judge that No. 18 chromosomal aneuploidies anomaly exist excessive risk, these three samples of S0003, S0006 and S0011 ZVchr21
Both greater than it is equal to 4.5, judges that No. 21 chromosomal aneuploidies anomaly exist excessive risk.For No. 21, No. 18 and No. 13 chromosomes,
The testing result of the application is consistent with chromosome karyotype analysis result, and this method testing result is determined as the sample of low-risk,
That is sample of the ZV values between -4.5 to 4.5, its karyotyping result are also normal.Show this method non-for chromosome
Detection accuracy is higher during the detection of multiple sexual abnormality.
The stability of embodiment 3 and data quantity research
(1) sample stability
Using the above method, to s002, s006, s007, s008 this four samples, (corresponding results of karyotype is respectively T13
Positive, the T21 positives, T18 are positive and normal) repeat to survey 8 times respectively, statistics chromosome relative expression quantity (being designated as CR) and Z test
Value (being designated as ZV) data fluctuates situation, and to evaluate the stability of the detection method, evaluation result is shown in Table 3.
Table 3.s002, s006, s007, s008 repeatability detection data summary table
In upper table 3, Mean represents average value, and SD represents standard deviation, and CV represents coefficient of dispersion.As known from Table 3,4 samples
The CV (centrifugal pump) that product repeat CR values corresponding to detection 8 times is respectively less than 0.01, and ZV fluctuation (SD values) also ± 1.1 it
Interior, data fluctuations are smaller, show that the stability of this method is preferable.
(2) data quantity research
In sequencing data amount in 0.25M (raw reads) to 15M scopes, NRS numbers on research genome specificity region
Fluctuation situation.For the sequencing datas of this four samples of s002, s006, s007, s008, (corresponding results of karyotype is respectively
T13 is positive, T21 is positive, T18 is positive and normal), the random data volume for intercepting 2M to 15M, genome alignment and system are carried out with this
Count ZV and CV (the coefficient of dispersion values of sample non repetitive sequence number in all specific regions of genome.Statistical result is shown in Table
4。
CV values (coefficient of dispersion) corresponding to the different sequencing data amounts of table 4. and ZV values (Z values)
It was found from upper table 4, the chromosome detection of the suitable wide spectrum data volume of this method, is 1M and more than 1M especially in data volume
When, the stability of data and the result of Z test are all preferable.
As can be seen from the above description, the above embodiments of the present invention realize following technique effect:By to survey
Ordinal number divides specific regions by the non repetitive sequence using equal bar number based on as principle, avoids each special
Data fluctuations caused by non-repetitive sequences number heterogeneity in property region, and then optimize interchromosomal nucleic acid data parameters
Correlation, utilize the parameter ratio with the parameter of clinically relevant chromosome in biological specimen and other non-clinical relative chromosome areas
It is right, so that it is determined that chromosomal aneuploidy whether there is in sample to be tested.Single pattern detection is the method achieve, and need not be marked
Accurate normal sample, eliminates the dependence to experiment condition, accelerates analyze speed, general improvements are to chromosome abnormality
Assess.Provide a kind of simple, fast and accurately chromosomal aneuploidy detection means, the accuracy rate of its autosome detection
More than 99%, false positive rate is less than 1%.This method is relatively various, and present approach reduces false negative recall rate;It is relatively existing
Single Sample Method, the requirement to sequencing data amount are smaller.
Obviously, those skilled in the art should be understood that some modules, element or some steps of above-mentioned the application can
To be realized with general computing device, they can be concentrated on single computing device, or are distributed in multiple calculate and are filled
Put on formed network, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to will
They are stored and performed in the storage device by computing device, or they are fabricated to each integrated circuit modules respectively, or
Multiple modules or step in them are fabricated to single integrated circuit module to realize by person.So, the application is not restricted to appoint
What specific hardware and software combines.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.