CN105844116B

CN105844116B - The processing method and processing unit of sequencing data

Info

Publication number: CN105844116B
Application number: CN201610161767.1A
Authority: CN
Inventors: 张必良; 曹亮; 叶奕栋
Original assignee: Guangzhou Rui Kang Medical Laboratory Co Ltd; GUANGZHOU RIBOBIO CO Ltd
Current assignee: Guangzhou Ribobio Co ltd
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2018-02-27
Anticipated expiration: 2036-03-18
Also published as: CN105844116A

Abstract

The invention provides a kind of processing method of sequencing data and processing unit.The processing method includes：Nucleotide sequence information from maternal peripheral blood sample is obtained by high-flux sequence；Reference gene group is divided into multiple specific regions, NRSc values are equal in each specific regions；It will be distributed from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample to multiple specific regions of reference gene group, NRSs value of the statistical sample in each specific regions；Using the NRSs values in each specific regions of G/C content amendment sample, NRSs' values are designated as；Count target chromosome respectively based on NRSs' values and compare the average of the NRSs' values of all specific regions on chromosome, be designated as the first average and the second average；First average and the second average are subjected to test of difference, determine that chromosome whether there is aneuploidy according to difference test result.The processing method improves the accuracy to sequencing data processing.

Description

The processing method and processing unit of sequencing data

Technical field

The present invention relates to sequencing data process field, processing method and processing in particular to a kind of sequencing data Device.

Background technology

Chromosome abnormality be probably in number or structure on.Quantity is abnormal, including trisomy (more chromosomes), Monosomy (losing a chromosome) and polyploidy (entirely more a set of chromosome).Textural anomaly includes being caused by chromosome breakage etc. Structural rearrangement, such as transposition, upset, missing and insertion.

Chromosome quantitative is abnormal, such as aneuploidy and polyploidy, includes inborn defect with a variety of diseases and cancer is relevant.I The annual neonate of state nearly 20,000,000, wherein about 4%~6% has inborn defect, wherein fetal chromosomal abnormalities are clinical most normal One of inborn defect type seen, it is abnormal chromosome patients just to have 1 in about 160 neonates according to statistics.Chromosome trisomy Syndrome is that incidence of disease highest is a kind of in chromosomal disorders, when the number of certain intracellular chromosome be not normal two but Three, namely total chromosome number mesh be 47 when may result in patau syndrome.Most common trisomic syndrome has：21 Patau syndrome (T21), Edwards syndrome (T18) and Patau syndrome (T13).To reduce the ratio of inborn defect baby Example, the fast and accurately detection to chromosomal aneuploidy is necessary.

Ultrasound scanning or the non-invasive methods of biochemical markers examination, have been used for carrying out the wind of chromosome abnormality Danger judges, but this method accuracy rate is relatively low, only 60-80%, and the influence of the physiologic factor such as age of becoming pregnant.And the antenatal of routine is examined Disconnected method is then needed by invasive method such as amniocentesis or chorionic villus sampling, therefore risk of miscarriage be present, and detects week Phase is longer.1997, the acellular foetal DNA (Lancet.1997Aug 16 of circulation is found that in Maternal plasma；350 (9076):485-7.Presence of fetal DNA in maternal plasma and serum.Lo YM1, Corbetta N,Chamberlain PF,Rai V,Sargent IL,Redman CW,Wainscoat JS.).1999, hair Now nourish and the concentration of foetal DNA is circulated in women's blood plasma of No. 21 chromosome trisomy fetuses apparently higher than nourishing euploid fetus woman Concentration (Lo, Y.M.D.et al., the Clin Chem 45 of foetal DNA is circulated in female's blood plasma：1747-1751(1999)；Zhong, X.Y.et al., Prenat Diagn 20：795-798(2000)).It is above-mentioned be found to be noninvasive pre-natal diagnosis provide it is new can Can property.On this basis, antenatal noninvasive field achieves many progress, such as by methyl-sensitive enzyme enriches fetal DNA to drop Low parent ambient interferences (PCT/US2004/033175 2004.10.08)；By PCR compare gene-specific fragments Ct values with No. 21 three bodies (CN200610003103.9,2006.02.10) of examination；Pushed away by the amplified allele detection based on RNA-SNP Disconnected fetal chromosomal aneuploidy (CN200680007354.2,2006.03.17).But the time-consuming consumption of enrichment to foetal DNA Power, and amplification technique requires the specificity of sequence or the heterozygosity of gene, makes it be difficult to turn into a kind of general technology.

2008, Rossa W.K.Chiu et al. proposed that sequencing means can obtain the bulk information of peripheral blood nucleic acid molecule (Rossa W.K.Chiu, et al.Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in Maternalplasma.PNAS, 2008,105：20458-20463), and find there is abnormal sample on clinical meaning chromosome In this, its clinical meaning chromosome has the ratio of the amount of abnormal nucleic acid molecules and the amount of the nucleic acid molecules of background stainings body ginseng Number, the parameter of one or more normal control values with being built by normal sample have differences.Thus, based on high-flux sequence Method can be used to detect chromosome abnormality, and eliminate to distinguished sequence amplification dependence.But existing gene order-checking inspection Survey method needs sample to be tested and multiple samples or standard normal sample being compared, and time-consuming, to sample requirement amount (e.g., Application No. CN200880108377.1 Chinese patent application) greatly, and to the uniformity of each batch sample experiment condition There is strict demand, constrain its facilitation and high-throughout application.

Therefore, it is still necessary to the method for existing processing sequencing data is improved, to improve the accuracy of data processing.

The content of the invention

It is a primary object of the present invention to provide the processing method and processing unit of a kind of sequencing data, to improve to sequencing The accuracy of data processing.

To achieve these goals, according to an aspect of the invention, there is provided a kind of processing method of sequencing data, is somebody's turn to do Processing method includes：The nucleotide sequence that all chromosomes from maternal peripheral blood sample are obtained by high-flux sequence is believed Breath；Reference gene group is divided into multiple specific regions, the number N RSc of non repetitive sequence is equal in each specific regions； It is multiple special to reference gene group by being distributed from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample Property region, NRSs value of the statistical sample in each specific regions；Using G/C content amendment sample in each specific regions Interior NRSs values, are designated as NRSs' values；Based on NRSs' values, all specificity on target chromosome and control chromosome are counted respectively The average of the NRSs' values in region, correspond to be designated as the first average and the second average respectively；It is poor that first average and the second average are carried out The opposite sex is examined, and determines that chromosome whether there is aneuploidy according to difference test result.

Further, included using the step of NRSs values of the G/C content amendment sample in each specific regions：Utilize Correction formula NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions, wherein,For the I d median of all specific regions NRSs values, NRSs " is to utilize each special of sample Property region the progress polynomial spline fitting of G/C content and NRSs values after the match value that obtains.

Further, polynomial spline fitting is carried out with NRSs values in the G/C content of each specific regions using sample Before, the step of processing method also includes removing the specific regions of NRSs values exception from all specific regions of sample, It is preferred that the method being fitted using linear fit or local polynomial regression removes the abnormal specific regions of NRSs values.

Further, NRSc values are the arbitrary integer in 10000~50000.

Further, target chromosome is selected from one or several following any combination：No. 13 chromosomes, No. 18 dyeing Body, No. 21 chromosomes, X chromosome and Y chromosome；Compare chromosome and be selected from one or several following any combination：No. 1 dye Colour solid, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, 9 Number chromosome, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes；Preferably, chromosome is compareed selected from following any one Bar or the combination of several：No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16 chromosomes.

To achieve these goals, according to another aspect of the present invention, there is provided a kind of processing unit of sequencing data, The processing unit includes：Sequencer module, for obtaining all dyeing from maternal peripheral blood sample by high-flux sequence The nucleotide sequence information of body；Specific regions division module, for being drawn reference gene group according to the equal principle of NRSc values It is divided into multiple specific regions；Distribute module, for according to the principle that sequence alignment is carried out with reference gene group, mother will to be derived from The nucleotide sequence information of all chromosomes of peripheral body sample is distributed to multiple specific regions of reference gene group；First Statistical module, for NRSs value of the statistical sample in each specific regions；Correcting module, for utilizing G/C content amendment sample Originally the NRSs values in each specific regions, are designated as NRSs' values；Second statistical module, for based on NRSs' values, uniting respectively Count target chromosome and compare the average of the NRSs' values of all specific regions on chromosome, be designated as the first average and second equal Value；Inspection module, for the first average and the second average to be carried out into test of difference；Determining module, for according to difference test As a result determine that chromosome whether there is aneuploidy.

Further, correcting module includes：First computing unit, for calculating the middle position of all specific regions NRSs values Numerical valueFitting unit, G/C content and NRSs values for each specific regions using sample carry out polynomial spline Fitting, obtains matched curve；Acquiring unit, for obtaining the match value NRSs " of each specific regions according to matched curve；The Two computing units, for basisFormula calculates correction factor α；Amending unit, for according to correction formula NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions.

Further, fitting unit is more using the G/C content of each specific regions of sample and the progress of NRSs values in execution Formula spline-fit, before the step of obtaining matched curve, fitting unit also includes filtering subelement, and filtering subelement is used to hold Row removes the step of NRSs values abnormal specific regions from all specific regions of sample, and it is line preferably to filter subelement Property fitting subelement or local polynomial regression fitting subelement.

Further, NRSc values are the arbitrary integer in 10000~50000.

Apply the technical scheme of the present invention, by based on sequencing data, by with the non repetitive sequence of equal bar number Specific regions are divided for principle, avoid in each specific regions number caused by non-repetitive sequences number heterogeneity According to fluctuation, and then optimize the correlation of interchromosomal nucleic acid data parameters, using with clinically relevant chromosome in biological specimen Parameter and the parameter in other non-clinical relative chromosome areas compare, so that it is determined that in sample to be tested chromosomal aneuploidy whether In the presence of.It the method achieve single pattern detection, it may not be necessary to the normal sample of standard, eliminate the dependence to experiment condition Property, accelerate analyze speed, be kind simple, fast and accurately detection means, the accuracy rate of its autosome detection 99% with On, false positive rate is less than 1%.

Brief description of the drawings

The Figure of description for forming the part of the application is used for providing a further understanding of the present invention, and of the invention shows Meaning property embodiment and its illustrate be used for explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 shows S001 samples (negative sample) sequencing sequence in a kind of preferred embodiment 1 according to the present invention Distribution schematic diagram of the middle non repetitive sequence on genome in each specific regions；And

Fig. 2 shows that the non repetitive sequence in Fig. 1 in S001 samples sequencing sequence after Exception Filter value is each on genome Distribution schematic diagram in specific regions；

Fig. 3 shows that the non repetitive sequence in Fig. 2 in S001 samples sequencing sequence after Exception Filter value is each on genome Spline curve fitting figure in specific regions；

Fig. 4 a and Fig. 4 b respectively illustrate before the autosomal amendments of each bar of S001 samples in embodiment 1 and revised The number of non repetitive sequence in specific regions；Wherein, before Fig. 4 a displays amendment, after Fig. 4 b display amendments；

Fig. 5 a and Fig. 5 b respectively illustrate the autosomal amendment of each bar of S002 samples in another preferred embodiment The number of non repetitive sequence in preceding and revised specific regions；Wherein, before Fig. 5 a displays amendment, Fig. 5 b display amendments Afterwards；

Before Fig. 6 a and Fig. 6 b respectively illustrate the autosomal amendment of each bar of S007 samples in another preferred embodiment With the number of the non repetitive sequence in revised specific regions；Wherein, before Fig. 6 a displays amendment, after Fig. 6 b display amendments；

Fig. 7 a and Fig. 7 b respectively illustrate the autosomal amendment of each bar of S006 samples in another preferred embodiment The number of non repetitive sequence in preceding and revised specific regions；Wherein, before Fig. 7 a displays amendment, Fig. 7 b display amendments Afterwards；

Fig. 8 a, Fig. 8 b and Fig. 8 c respectively illustrate in embodiments herein 2 No. 13 dye in 384 online data samples The Z Distribution value figures of colour solid, No. 18 chromosome and No. 21 chromosome, wherein, Fig. 8 a show No. 13 chromosome, and Fig. 8 b are shown No. 18 chromosome, Fig. 8 c show No. 13 chromosome.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Term is explained：

Sequencing data：Refer to sample to be tested and pass through the nucleotide sequence information that high-flux sequence obtains.

kmer：Sequence is continuously cut in a manner of moving base one by one, obtained sequence length is k nucleotides Sequence, such as this following sequence：ATCGTTGCTTAATGACGTCAGTCGAAT, if if 13-mer is analyzed, k-mer For ATCGTTGCTTAAT, TCGTTGCTTAATG, CGTTGCTTAATGA, GTTGCTTAATGAC ....

Non-repetitive sequences (non-repeated sequence, abbreviation NRS)；By the way that sample to be tested is sequenced into what is obtained Sequence is compared with normal human subject genome, and unique kmer in the full-length genome level of acquisition is non-repeatability sequence Row.In the application, according to wait bar number non repetitive sequence to divide specific regions when, the bar number of division is according to reference gene Sequence is organized to be divided, thus, the bar number scale of non repetitive sequence is NRSc in each specific regions for dividing to obtain, and The bar number scale of the actual non repetitive sequence in above-mentioned each specific regions of the sequencing sequence of sample to be tested is NRSs.

Specific regions (specified region, abbreviation SR), according to specific regions described in the present invention The specific region on each bar chromosome of genome obtained by division methods.

Chromosome：Both whole chromosome can be referred to, a part for chromosome can also be referred to.Handle item chromosome fragment Mathematical derivation is consistent with the mathematical derivation of all chromosome segments of processing, and those skilled in the art knows corresponding change Method.Control chromosome is chromosome or the normal chromosome of presumption in healthy individuals, including statistics presumption is normally, here Chromosome be that individual chromosome or genome (are more than or equal to 2 chromosomes, are non-13,18,21, X, Y dyeing in other words Body or its any combination).

" aneuploidy " and " polyploidy " is the chromosome number and common haploid number n or amphiploid number 2n that cell has Different situations.Aneuploid cell can be the cell with triploid, i.e., three copy numbers with a chromosome it is thin Born of the same parents；Or be monoploid, i.e. the cell singly copied with a chromosome.Chromosomal aneuploidy, change homologue Expression quantity, bioinformatic analysis method can be combined by new-generation sequencing platform (NGS), according to sequencing comparison result system The expression quantity for counting each bar chromosome can be determined that sample to be tested whether there is the Dysploid of the chromosome.

Sample is cell, tissue or body fluid, be may be selected from：Maternal whole blood (peripheral blood), blood plasma, serum, urine, saliva, life Grow flushing liquor；Biopsy material before fetal cell or fetal cell residue, Embryonic limb bud cell；Amniotic fluid, chorionic villi sample etc.. Sample may be from any animal, preferably mammal, more preferably people.

It can be the short sequence of both-end, single-ended long sequence or single-ended short sequence that sequencing is carried out to DNA sequencing library Sequencing.Wherein the short sequence of both-end refers to the and then sequence less than 50bp of 5 ' end link primers and and then 3 ' holds and link primer The sequence less than 50bp.Preferably, the short sequence of both-end refer to and then the sequence no more than 36bp of 5 ' end link primers with And then the sequence no more than 36bp of 3 ' end link primers.

Single-ended short sequence refers to the and then sequence less than 50bp of 5 ' end link primers or and then 3 ' ends link primer The sequence less than 50bp.Preferably, single-ended short sequence refer to the and then sequence no more than 36bp of 5 ' end link primers or And then the sequence no more than 36bp of 3 ' end link primers.Single-ended long sequence refers to that and then 5 ' ends link being more than for primers 99bp sequence or the and then sequence more than 99bp of 3 ' end link primers.Both-end sequencing refers to test respectively positioned at sequence two The sequence at end.The single-ended sequence for referring to be pointed to sequence one end that is sequenced is sequenced.

Because the detection method of existing chromosomal aneuploidy still suffers from shortcoming in terms of accuracy and convenience, in order to Improve this situation, in a kind of typical embodiment of the application, there is provided a kind of processing method of sequencing data, the processing Method includes：The nucleotide sequence information of all chromosomes from maternal peripheral blood sample is obtained by high-flux sequence； Reference gene group is divided into multiple specific regions, number (being designated as NRSc) phase of non repetitive sequence in each specific regions Deng；It will be distributed from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample to multiple spies of reference gene group Specific region, NRSs value of the statistical sample in each specific regions；Using G/C content amendment sample in each specific area NRSs values in domain, are designated as NRSs' values；Based on NRSs' values, count all special on target chromosome and control chromosome respectively Property region NRSs' values average, respectively correspond to be designated as the first average and the second average；First average and the second average are carried out Test of difference, determine that chromosome whether there is aneuploidy according to difference test result.

The above-mentioned processing method of the application, by based on sequencing data, by with the non repetitive sequence of equal bar number Specific regions are divided for principle, avoid in each specific regions number caused by non-repetitive sequences number heterogeneity According to fluctuation, and then optimize the correlation of interchromosomal nucleic acid data parameters, using with clinically relevant chromosome in biological specimen Parameter and the parameter in other non-clinical relative chromosome areas compare, so that it is determined that in sample to be tested chromosomal aneuploidy whether In the presence of.It the method achieve single pattern detection, it may not be necessary to the normal sample of standard, eliminate the dependence to experiment condition Property, accelerate analyze speed, be kind simple, fast and accurately detection means, the accuracy rate of its autosome detection 99% with On, false positive rate is less than 1%.

Specifically, the method for above-mentioned test of difference can be existing various test of difference, such as, Z test (Z- Test), u-test or t inspections etc..The preferred Z test of the application.

In above-mentioned processing method, using can be with the step of NRSs value of the G/C content amendment sample in each specific regions The accuracy of detection can be also improved using existing GC bearing calibrations.In order that detection accuracy is higher, it is excellent in the application one kind In the embodiment of choosing, above-mentioned modification method includes：Sample is corrected in each specificity using correction formula NRSs'=NRSs × α NRSs values in region, wherein,For the I d median of all specific regions NRSs values, NRSs " is the fitting obtained after G/C content and NRSs values the progress polynomial spline fitting for each specific regions for utilizing sample Value.Revised NRSs' more Normal Distributions, so that follow-up test of difference result is more accurate.

Fitting is discrete point (G/C content is X, the coordinate of Y-axis with NRSs values) { f1, f2 ..., fn } known to, is passed through Adjust some undetermined coefficient f (λ in fitting function₁,λ₂..., λ n) so that difference (the least square meaning of the function and known point set Justice) it is minimum.Known point (x_i,Y_i)；x₁＜ x₂＜ ... ＜ x_n, i ∈ Z are a series of observations, meet certain relational expressionBuild fitting functionSo that：Y_i=μ (x_i) minimum.If fitting function is non-thread Property function, then referred to as nonlinear fitting, is also called spline-fit.Accordingly, if fitting function is multinomial, can claim For polynomial spline be fitted.Preferably polynomial spline fitting of the invention, SPL is smooth cubic curve.

Cubic spline curve gives n data point, shares n-1 section, and the equation in each section is：f_i=a_i+b_i(x- x_i)+c_i(x-x_i)²+d_i(x-x_i)³, 4 (n-1) individual unknowm coefficients need to be determined, by first derivative at continuity, node it is equal, two Order derivative is equal, can obtain 4n-6 equation, then artificially 2 boundary conditions of addition.Pass through the function of R software systems Smooth.spline completes spline-fit (http://www.stat.wisc.edu~xie/smooth_ splinetutorial.html)。

It is above-mentioned before polynomial spline fitting is carried out using the G/C content of each specific regions of sample and NRSs values The step of processing method also includes removing the specific regions of NRSs values exception from all specific regions of sample, it can adopt Exceptional value is removed with the method for GC linear fits or by way of artificial screening, for example it is 0, non repetitive sequence to delete GC values The window that number is 0 or non repetitive sequence number is significantly excessive.In this application, it is preferred to use local polynomial regression is fitted Method remove the specific regions of NRSs values exception, this method is advantageous to discharge part non-specificity region because of chromosome structure The too high or too low exquisite specificity region of the number of internal non repetitive sequence caused by specificity.In addition it is also possible to using Linear fit approximating method.Approximating method is the method for the conventional removal exceptional value of statistics or field of bioinformatics, specifically Method will not be repeated here.

In above-mentioned processing method, divide specific regions when be to be divided according to the equal principle of NRSc values, have Body NRSc values can be determined according to modes such as the Genome Size of sample to be tested, sequence complexities.It is preferred that NRSc values are Arbitrary integer in 10000~50000.

In above-mentioned processing method, tissue, cell derived that target chromosome and control chromosome can be according to samples to be tested The different of different or species different or actually detected demands carry out reasonable selection.When sample to be tested is the mankind, preferably Target chromosome is selected from one or several following any combination：No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X dyeing Body and Y chromosome；Compare chromosome and be selected from one or several following any combination：No. 1 chromosome, No. 2 chromosomes, No. 3 Chromosome, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 dyeing Body, No. 11 chromosomes and No. 12 chromosomes；It is highly preferred that control chromosome is selected from one or several following any combination：1 Number chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and 16 Number chromosome.

In another typical embodiment of the application, a kind of processing unit of sequencing data, the processing are additionally provided Device includes：Sequencer module, for obtaining the core of all chromosomes from maternal peripheral blood sample by high-flux sequence Nucleotide sequence information；Specific regions division module is more for being divided into reference gene group according to the equal principle of NRSc values Individual specific regions；Distribute module, for according to the principle that sequence alignment is carried out with reference gene group, maternal peripheral will to be derived from The nucleotide sequence information of all chromosomes of blood sample is distributed to multiple specific regions of reference gene group；First statistics mould Block, for NRSs value of the statistical sample in each specific regions；Correcting module, for utilizing G/C content amendment sample every NRSs values in individual specific regions, are designated as NRSs' values；Second statistical module, for based on NRSs' values, counting target respectively The average of the NRSs' values of all specific regions, is designated as the first average and the second average on chromosome and control chromosome；Examine Module, for the first average and the second average to be carried out into test of difference；Determining module, for being determined according to difference test result Chromosome whether there is aneuploidy.

Above-mentioned detection device with improved specific regions by based on the sequencing data that sequencer module obtains, drawing Sub-module divides specific regions using the non repetitive sequence of equal bar number as principle, optimizes interchromosomal nucleic acid data parameters Correlation, then by perform successively distribute module, the first statistical module, correcting module, the second statistical module, examine mould Block, compared using with the parameter of clinically relevant chromosome in biological specimen and the parameter in other non-clinical relative chromosome areas, really Cover half block determines that chromosomal aneuploidy whether there is in sample to be tested eventually through the test of difference result of inspection module.Should Device realizes the detection of single sample sequencing data, and does not need the normal sample of standard, eliminates the dependence to experiment condition Property, accelerate analyze speed, assessment of the general improvements to chromosome abnormality.It is that one kind is simple, fast and accurately chromosome is non- Ortholoidy detection means, for the accuracy rate of its autosome detection more than 99%, false positive rate is less than 1%.

Specifically, above-mentioned inspection module can be existing various test of difference modules, such as, Z test (Z-test) Module, u-test module or t inspection modules etc..The preferred Z test module of the application.

Above-mentioned correcting module can also improve the accuracy of detection using existing GC correcting modules.It is a kind of preferably in the application Embodiment in, above-mentioned correcting module includes：First computing unit, for calculating the median of all specific regions NRSs values ValueFitting unit, G/C content and NRSs values for each specific regions using sample carry out polynomial spline plan Close, obtain matched curve；Acquiring unit, for obtaining the match value NRSs " of each specific regions according to matched curve；Second Computing unit, for basisFormula calculates correction factor α；Amending unit, for according to correction formula NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions.

In above-mentioned preferred embodiment, have the fitting degree of accuracy high by using the fitting unit of polynomial spline fitting Advantage, in order to more accurately obtain match value, correspondingly, the correction factor being calculated by the second computing unit is also more accurate Really, and then NRSs value of the sample to be tested in each specific regions can be more accurately obtained by amending unit, that is, obtained The higher NRSs' values of the degree of accuracy.

In above-mentioned processing unit, fitting unit is performing G/C content and NRSs values using each specific regions of sample Polynomial spline fitting is carried out, before the step of obtaining matched curve, fitting unit also includes filtering subelement, filters subelement The step of removing the specific regions of NRSs values exception from all specific regions of sample for performing, can so enter one Step improves fitting unit and is fitted the degree of accuracy in polynomial spline fit procedure is carried out.It is preferred that filtering subelement is using conventional line Property fitting subelement or local polynomial regression fitting subelement carry out exceptional value filtering.

Preferably, NRSc values are the arbitrary integer in 10000~50000 in above-mentioned processing unit.

In above-mentioned processing unit, tissue, cell derived that target chromosome and control chromosome can be according to samples to be tested The different of different, person's species different or actually detected demands carry out reasonable selection.When sample to be tested is the mankind, preferred mesh Mark chromosome and be selected from one or several following any combination：No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X chromosomes And Y chromosome；Compare chromosome and be selected from one or several following any combination：No. 1 chromosome, No. 2 chromosomes, No. 3 dyes Colour solid, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes；It is highly preferred that control chromosome is selected from one or several following any combination：No. 1 Chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16 Chromosome.

The above method and its device of the application can be combined with other known method, apparatus or composition, can preferably be improved The method, apparatus or composition of chromosome abnormality detection technique.For example, the mathematics model analysis of parent biochemical indicator.

The above method provided herein, it has the excellent of high flux, low cost, simplicity, the degree of accuracy and high sensitivity Gesture.Existing method needs sample to be tested and multiple samples or standard normal sample being compared, and time-consuming, and to sample This demand is big.The application realizes single pattern detection, can be avoided independent of the normal sample of standard to experiment condition Dependence, accelerate analyze speed and improve Detection accuracy.

The such scheme that the application provides is to be combined DNA sequencing means with the method for analysis of biological information, passes through Z values The otherness method of inspection such as examine to judge chromosome with the presence or absence of abnormal.If Z values are outside 4.5, it may be determined that be dyeing be present Body aneuploidy.Chromosome abnormality is preferably No. 21 chromosome trisomies, No. 13 chromosome trisomies, No. 18 chromosomes, X chromosome and The exception of Y chromosome.

The application method is particularly suitable for use in, and detection chromosome quantitative is abnormal, and preferably chromosomal aneuploidy quantity is abnormal, more It is preferred that autosome aneuploidy is abnormal.

Further illustrate the beneficial effect of the application below in conjunction with specific embodiments.

Processing method of the embodiment 1 to sample to be tested sequencing data

(1) high-flux sequence is carried out to the DNA fragmentation that dissociates in sample to be tested maternal blood

(1) pregnant woman's whole blood is gathered, blood plasma is obtained by pretreatment；

After notice of consent is approved, by venipuncture from 22 weeks women of pregnancy (i.e. sample S001 in continued 2 afterwards) Blood blood sampling volume 5-10ml is taken, is added in ethylenediamine tetra-acetic acid (EDTA) pipe, blood sample is removed after high speed centrifugation The plasma sample of haemocyte, each sample plasma volume is about 700ul.

(2) plasma dna is extracted；

The DNA extraction agents box HiPure Circulating DNA Kits that are produced using Magen companies extract blood plasma In DNA (production number D3180-02).

(3) DNA for extracting to obtain from blood plasma is prepared into the library for being available for high-flux sequence platform sequencing of new generation

Plasma dna carries out end reparation using T4DNA polymerases, T4PNK and Klenow enzymes and adds A processing, uses T4DNA Ligase and sequence measuring joints carry out adjunction head processing.Finally use the library primer added with label to enter performing PCR, and entered using magnetic bead Row purifying screening, finally give the sequencing library of machine.

(4) DNA sequencing is carried out to the library prepared

Sequencing library expands on Illumina cBot instruments, and DNA clusters are made in the single-ended sequencing libraries of DNA, obtain magnanimity Sequencing reading length is 36bp sequence.

(2) sequence information of the DNA fragmentation in blood plasma is determined

1. pair normal human subject reference gene group carries out specific regions division and statistics

(1) non-repetitive sequences are screened

By mankind's reference gene group (hg19GRCh37http://www.ncbi.nlm.nih.gov/projects/ Genome/assembly/grc/), it is 35bp to be cut into length, and the magnanimity kmer that offset is 1bp gathers；Therefrom screening obtains Unique kmer, i.e. non-repetitive sequences on full-length genome, and location coordinate information corresponding to record.

(2) specific regions divide

From first non-repetitive sequences start recording original position of No. 1 chromosome, until remembering when being accumulate to 20000 Its final position is recorded, first specific regions this being defined as on No. 1 chromosome, is not present between each specific regions It is overlapping.

For No. 1 chromosome until Y chromosome repeats the processing step of top, all chromosome specific regions are obtained Positional information and G/C content (specific regions division is carried out to normal human subject reference gene group need to only be carried out once, follow-up every The specific regions that individual testing sample divides according to reference gene group are handled).

(3) specific regions count

Count the G/C content distribution of all non-repetitive sequences in specific regions quantity and the region on each bar chromosome Situation.

2. sample DNA sequence alignment

Software BWA (Burrows-Wheeler Aligner) is compared by Bioinformatic Sequence, the DNA of gained will be sequenced Sequence carries out not fault-tolerant compare with normal human subject reference gene group (hg19, GRCh37) and (matching completely, is not allow for base mistake With), determine detailed location information of all sequencing DNA sequence dnas on genome, including the coordinate on chromosomal origin, chromosome And genome specificity Regional Distribution of Registered etc. (in table 2 in S001 samples sequencing sequence non repetitive sequence on genome Distribution situation in each specific regions is shown in Fig. 1).

(3) expression quantity of chromosome to be measured is determined

1st, Exception Filter value

By the number of non-repetitive sequences in the G/C content in the genome specificity region of sample to be tested and the region (NRSs) local polynomial regression fitting (linear fit also can) is carried out by loess functions, by NRSs numbers in match value positive and negative 3 (p outside times standard deviation<0.005) definition is exceptional value, and the distribution after exceptional value is filtered is as shown in Figure 2.

2nd, weighting amendment

After all specific regions of the genome of sample to be tested are classified according to G/C content, SPL plan is carried out Conjunction obtains the match value of NRSs corresponding to each G/C content, is designated as NRSs ", its corresponding distribution situation is as shown in Figure 3.

Wherein, specific fit procedure is：With NRSs I d medianFor baseline, by NRS match value NRSs " with For baseline value compared to correction factor α is obtained, calculation formula is as follows；

NRSs'=NRSs × α (2)

Above-mentioned formula is calculated for each specific regions on sample to be tested genome, wherein,Refer to Be NRS numbers on all specific regions on genome I d median, NRSs " is match value, and NRSs' is revised non-heavy Complex sequences number.

Before amendment from figure 4 below a and Fig. 4 b, Fig. 5 a with Fig. 5 b, Fig. 6 a and Fig. 6 b and Fig. 7 a and Fig. 7 b and after amendment As can be seen that unmodified data fluctuations are bigger, directly carry out the otherness between chromosome and be easier to cause false the moon The testing result of property or false positive.And the non repetitive sequence number distribution situation after correcting in the specific regions of each chromosome becomes In steady, data variance is more notable, it is easier to and judge exceptional value, show that the present processes can eliminate GC architectural differences, and Avoid GC preference sex chromosome mosaicisms.The detection abnormal available for chromosomal aneuploidy, reduces the appearance of false negative result, below figure 7a With Fig. 7 b chr21 corresponding to NRS numbers be higher by with other autosomes are obvious, corresponding testing result is the sample 21 It is high the abnormal risk of aneuploidy to be present in chromosome.

(4) Z values test and judge chromosomal expression amount whether there is significant difference

With NRSs through the revised NRSs' of GC, by the institute of target detection chromosome (chr21, chr18, chr13, X or Y) Have the NRSs' of specific regions average, with compare chromosomal (chr1, chr2 ... chr12) all NRSs' it is equal Value carries out otherness comparison, obtains detected value Z (Z-score), judges that current target chromosome whether there is non-multiple according to Z values Property variation.When Z-score >=4.5 or Z-score≤- 4.5, i.e. testing result is high for three bodies variation excessive risk, or monomer variants Risk；As -4.5 ＜ Z-score ＜ 4.5, i.e. testing result is Dysploid low-risk.

Or the distribution situation by house-keeping gene, filter out control chromosomal, including chr1, chr2, chr3, chr6、chr7、chr11、chr12、chr16。

The efficiency evaluation of embodiment 2

(1) evaluated using online data sample

The step in processing method shown in embodiment 1 can module or unit form by computing device come real It is existing.In order to evaluate the validity of the method for embodiment 1, below with the processing for module or the unit formation for being able to carry out above-mentioned steps Device is tested.The processing unit includes：

Sequencer module, for obtaining the nucleosides of all chromosomes from maternal peripheral blood sample by high-flux sequence Acid sequence information；

Alternatively, the cBot instruments of above-mentioned module including Illumina, Illumina Genome Analyzer, The supporting model sequenator such as HiSeq2000/2500, Hiseq3000/4000, NextseqCN500 or Life The module of sequencing function is performed in the supporting sequenator such as the SOLiD of Technologies companies.

Specific regions division module, specific regions division module program is called, will according to the equal principle of NRSc values Reference gene group is divided into multiple specific regions；Can be according to any integer bar in 10000~50000 (preferably 20000) Non repetitive sequence is that unit is divided, non-duplicate by existing in length such as the specific regions of 20Kb or 50Kb division to overcome Sequence number differs greatly and the defects of data homogeneity difference.Distribute module, distribute module is run, the knot that sequencer module is exported Fruit carries out sequence alignment with reference gene group, by from the nucleotide sequence information of all chromosomes of maternal peripheral blood sample Distribution is to caused by specific regions division module in specific regions；

Alternatively, module such as BWA modules, BOWTIE modules or the NOVOALIGN moulds of sequence alignment principle are able to carry out Block is used for carrying out the distribution of sample to be tested sequencing data,

First statistical module, for NRSs value of the statistical sample in each specific regions；Statistical module alternatively There are SAMTOOLS modules；

Correcting module, for the NRSs values using G/C content amendment sample in each specific regions, it is designated as NRSs' Value；

Preferably, correcting module includes：First computing unit, for calculating the median of all specific regions NRSs values ValueFitting unit, G/C content and NRSs values for each specific regions using sample carry out polynomial spline plan Close, obtain matched curve；Acquiring unit, for obtaining the match value NRSs " of each specific regions according to matched curve；Second Computing unit, for basisFormula calculates correction factor α；Amending unit, for according to correction formula NRSs'=NRSs × α corrects NRSs value of the sample in each specific regions.

Second statistical module, for based on NRSs' values, counting all special on target chromosome and control chromosome respectively The average of the NRSs' values in property region, is designated as the first average and the second average；

Inspection module, for the first average and the second average to be carried out into test of difference；Alternatively, using Z test module To carry out difference analysis；

Determining module, for determining that chromosome whether there is aneuploidy according to difference test result；

Preferably, when target chromosome is autosome, during and -4.5≤Z values≤4.5, for determining target chromosome not Aneuploidy be present, otherwise, it determines aneuploidy be present.

With from different experiments room, different NGS platforms data (from NCBI SRA databases http:// The noninvasive prenatal gene detection project clinical research that other mechanisms downloaded in www.ncbi.nlm.nih.gov/sra/ upload is pregnant The high-flux sequence data of woman's peripheral blood, wherein including 384 sample datas) filled for sample to further illustrate that the application is handled The validity and versatility put.

Wherein, it is as shown in table 1 below for No. 21 in 384 samples, the testing result of No. 18 and No. 13 chromosomes：

1. 384 NCBI online datas positive sample detection results of table.

It is attached：In upper table 1, " chr " represents chromosome；" gc " represents G/C content；" ZV " represents Z Value, Z values；“TEST” Represent the chromosomal aneuploidy abnormality detection result obtained by this method.

From above-mentioned table 1 and it was found from Fig. 8 a, Fig. 8 b and Fig. 8 c, 1 T13 positive SRR358477, remaining sample are detected No. 13 chromosomes the equal Stable distritation of Z values in (- 4.5,4.5) section；5 T18 positives SRR357943 of detection, SRR357972, SRR358089, SRR358257, SRR358325, the equal Stable distritation of Z values of No. 18 chromosomes of remaining sample (- 4.5,4.5) in section；Detect 7 T21 positives SRR357843, SRR358020, SRR358126, SRR358144, SRR358322, SRR358352, SRR358353, the equal Stable distritation of Z values of No. 21 chromosomes of remaining sample is in (- 4.5,4.5) area In.

(2) blood sample is evaluated

Evaluated by taking the testing result of 68 samples (being provided by Ministry of Public Health's visiting center and Beijing people hospital) as an example Processing method is stated, evaluation result is shown in Table 2, the result of 30 samples before only being shown in table 2, and is verified with results of karyotype.

The blood sample testing result of table 2.

Note：Represent be target detection chromosome it is unmodified before, in all specific regions NRS average value.What is represented is the chromosome after the weighting amendment of specific regions G/C content Upper all specific regions NRS average value.ZV_chri(i=13,18,21) what is represented is that the chromosome passes through with compareing dyeing Body carries out the Z values that significant difference is analyzed to obtain.What TEST was represented is that the chromosomal aneuploidy obtained by this method is abnormal Testing result, N (Negative) represent that testing result for feminine gender, does not detect obvious exception.T13/T18/T21 represents testing result Aneuploidy exception be present in display target detection chromosome.What caryogram (Karyotype) represented is clinical karyotyping result, i.e., Goldstandard result (46, XN represent be caryogram normal specimens chromosome number and sex chromosome situation, 47, XN ,+21 generations Table is that the sample karyotyping shows 47 chromosomes, and No. 21 chromosomes, i.e. Tang Shi synthesis are had more than normal karyotype Sign).

The as shown by data of table 2, according to significant difference assay：Two sample ZV of S0002 and S0013_chr13It is all higher than In 4.5, judge that No. 13 chromosomal aneuploidies anomaly exist excessive risk；Two sample ZV of S0007 and S0012_chr18It is all higher than In 4.5, judge that No. 18 chromosomal aneuploidies anomaly exist excessive risk, these three samples of S0003, S0006 and S0011 ZV_chr21 Both greater than it is equal to 4.5, judges that No. 21 chromosomal aneuploidies anomaly exist excessive risk.For No. 21, No. 18 and No. 13 chromosomes, The testing result of the application is consistent with chromosome karyotype analysis result, and this method testing result is determined as the sample of low-risk, That is sample of the ZV values between -4.5 to 4.5, its karyotyping result are also normal.Show this method non-for chromosome Detection accuracy is higher during the detection of multiple sexual abnormality.

The stability of embodiment 3 and data quantity research

(1) sample stability

Using the above method, to s002, s006, s007, s008 this four samples, (corresponding results of karyotype is respectively T13 Positive, the T21 positives, T18 are positive and normal) repeat to survey 8 times respectively, statistics chromosome relative expression quantity (being designated as CR) and Z test Value (being designated as ZV) data fluctuates situation, and to evaluate the stability of the detection method, evaluation result is shown in Table 3.

Table 3.s002, s006, s007, s008 repeatability detection data summary table

In upper table 3, Mean represents average value, and SD represents standard deviation, and CV represents coefficient of dispersion.As known from Table 3,4 samples The CV (centrifugal pump) that product repeat CR values corresponding to detection 8 times is respectively less than 0.01, and ZV fluctuation (SD values) also ± 1.1 it Interior, data fluctuations are smaller, show that the stability of this method is preferable.

(2) data quantity research

In sequencing data amount in 0.25M (raw reads) to 15M scopes, NRS numbers on research genome specificity region Fluctuation situation.For the sequencing datas of this four samples of s002, s006, s007, s008, (corresponding results of karyotype is respectively T13 is positive, T21 is positive, T18 is positive and normal), the random data volume for intercepting 2M to 15M, genome alignment and system are carried out with this Count ZV and CV (the coefficient of dispersion values of sample non repetitive sequence number in all specific regions of genome.Statistical result is shown in Table 4。

CV values (coefficient of dispersion) corresponding to the different sequencing data amounts of table 4. and ZV values (Z values)

It was found from upper table 4, the chromosome detection of the suitable wide spectrum data volume of this method, is 1M and more than 1M especially in data volume When, the stability of data and the result of Z test are all preferable.

As can be seen from the above description, the above embodiments of the present invention realize following technique effect：By to survey Ordinal number divides specific regions by the non repetitive sequence using equal bar number based on as principle, avoids each special Data fluctuations caused by non-repetitive sequences number heterogeneity in property region, and then optimize interchromosomal nucleic acid data parameters Correlation, utilize the parameter ratio with the parameter of clinically relevant chromosome in biological specimen and other non-clinical relative chromosome areas It is right, so that it is determined that chromosomal aneuploidy whether there is in sample to be tested.Single pattern detection is the method achieve, and need not be marked Accurate normal sample, eliminates the dependence to experiment condition, accelerates analyze speed, general improvements are to chromosome abnormality Assess.Provide a kind of simple, fast and accurately chromosomal aneuploidy detection means, the accuracy rate of its autosome detection More than 99%, false positive rate is less than 1%.This method is relatively various, and present approach reduces false negative recall rate；It is relatively existing Single Sample Method, the requirement to sequencing data amount are smaller.

Obviously, those skilled in the art should be understood that some modules, element or some steps of above-mentioned the application can To be realized with general computing device, they can be concentrated on single computing device, or are distributed in multiple calculate and are filled Put on formed network, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to will They are stored and performed in the storage device by computing device, or they are fabricated to each integrated circuit modules respectively, or Multiple modules or step in them are fabricated to single integrated circuit module to realize by person.So, the application is not restricted to appoint What specific hardware and software combines.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

1. a kind of processing method of sequencing data, it is characterised in that the treating method comprises：

The nucleotide sequence information of all chromosomes from maternal peripheral blood sample is obtained by high-flux sequence；

Reference gene group is divided into multiple specific regions, the number N RSc of non repetitive sequence in each specific regions It is worth equal；

The nucleotide sequence information of all chromosomes from maternal peripheral blood sample is distributed to the reference gene Multiple specific regions of group, count NRSs value of the sample in each specific regions；

Using NRSs value of the sample described in G/C content amendment in each specific regions, NRSs' values are designated as；

Based on the NRSs' values, target chromosome is counted respectively and compares the NRSs' values of all specific regions on chromosome Average, correspond to be designated as the first average and the second average respectively；

First average and the second average are subjected to test of difference, determine that chromosome whether there is according to difference test result Aneuploidy.

2. processing method according to claim 1, it is characterised in that using sample described in G/C content amendment each described The step of NRSs values in specific regions, includes：

NRSs value of the sample in each specific regions is corrected using correction formula NRSs'=NRSs × α, its In, it is described For the I d median of all specific regions NRSs values, NRSs " is to utilize the sample The G/C content of this each specific regions and the match value obtained after the progress polynomial spline fitting of NRSs values.

3. processing method according to claim 2, it is characterised in that in each specific regions using the sample Before G/C content carries out polynomial spline fitting with NRSs values, the processing method also includes all specificity from the sample The step of specific regions of NRSs values exception are removed in region.

4. processing method according to claim 3, it is characterised in that be fitted using linear fit or local polynomial regression Method removes the abnormal specific regions of NRSs values.

5. processing method according to claim 1, it is characterised in that the NRSc values be 10000~50000 in it is any whole Number.

6. processing method according to any one of claim 1 to 5, it is characterised in that

The target chromosome is selected from one or several following any combination：No. 13 chromosomes, No. 18 chromosomes, No. 21 dyeing Body, X chromosome and Y chromosome；

The control chromosome is selected from one or several following any combination：No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes, No. 11 dyes Colour solid, No. 12 chromosomes and No. 16 chromosomes.

7. a kind of processing unit of sequencing data, it is characterised in that the processing unit includes：

Sequencer module, for obtaining the nucleotides sequence of all chromosomes from maternal peripheral blood sample by high-flux sequence Column information；

Specific regions division module, for reference gene group to be divided into multiple specific areas according to NRSc values equal principle Domain；

Distribute module, for according to the principle that sequence alignment is carried out with the reference gene group, maternal peripheral to be derived from by described in The nucleotide sequence information of all chromosomes of blood sample is distributed to multiple specific regions of the reference gene group；

First statistical module, for counting NRSs value of the sample in each specific regions；

Correcting module, for using NRSs value of the sample in each specific regions described in G/C content amendment, being designated as NRSs' values；

Second statistical module, for based on the NRSs' values, counting all special on target chromosome and control chromosome respectively Property region NRSs' values average, respectively correspond to be designated as the first average and the second average；

Inspection module, for first average and the second average to be carried out into test of difference；

Determining module, for determining that chromosome whether there is aneuploidy according to difference test result.

8. processing unit according to claim 7, it is characterised in that the correcting module includes：

First computing unit, for calculating the I d median of all specific regions NRSs values

Fitting unit, G/C content and NRSs values for each specific regions using the sample carry out multinomial style Bar is fitted, and obtains matched curve；

Acquiring unit, for obtaining the match value NRSs " of each specific regions according to the matched curve；

Second computing unit, for basisFormula calculates correction factor α；

Amending unit, for correcting the sample in each specific regions according to correction formula NRSs'=NRSs × α NRSs values.

9. processing unit according to claim 8, it is characterised in that the fitting unit utilizes each of the sample in execution The G/C contents of the specific regions and the progress polynomial spline fitting of NRSs values, before the step of obtaining matched curve, also wrap Filtering subelement is included, the filtering subelement is used to perform removes NRSs values exception from all specific regions of the sample Specific regions the step of.

10. processing unit according to claim 9, it is characterised in that it is described filtering subelement be linear fit subelement or Local polynomial regression is fitted subelement.

11. processing unit according to claim 7, it is characterised in that the NRSc values be 10000~50000 in it is any whole Number.

12. the processing unit according to any one of claim 7 to 11, it is characterised in that