CN108268752A

CN108268752A - A kind of chromosome abnormality detection device

Info

Publication number: CN108268752A
Application number: CN201810047686.8A
Authority: CN
Inventors: 糜庆丰; 彭春方; 张娟; 赵宇; 陈样宜; 饶兴蔷; 罗东红; 黄铨飞; 刘丽菲
Original assignee: CapitalBio Genomics Co Ltd
Current assignee: CapitalBio Genomics Co Ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2018-07-10
Anticipated expiration: 2038-01-18
Also published as: CN108268752B

Abstract

The invention discloses a kind of chromosome abnormality detection devices.Existing chromosome abnormality detection and analysis are based on reading long counting statistics model, can only remove the repetitive sequence compared to initial position identical in genome, can not remove different for initial position but have the reads of overlapping between each other；Apparatus of the present invention pass through calling sequence coverage (coverage) statistical model, repetitive sequence and its overlapping region that unicellular whole genome amplification Preference is brought can be effectively removed, significantly improve the homogeneity of data, and then noise data is reduced, improve the recall rate of positive sample and reduces false positive rate.

Description

A kind of chromosome abnormality detection device

Technical field

The present invention relates to data processing techniques, and in particular to a kind of chromosome abnormality detection device.

Background technology

In recent years, help pregnant patient more and more with receiving supplementary reproduction, a large amount of clinical discoveries are in supplementary reproduction process Easily there is the situation of plantation failure or Unexplained spontaneous abortion repeatedly in the embryo of middle part high risk Mr. and Mrs, and test-tube baby is overall Live birth rate is studied less than 30% and finds that embryo chromosome is the main reason for test-tube baby is caused to fail extremely.Therefore, to embryo Tire carries out implantation prochromosome abnormality detection, and then the embryo of health is selected to be implanted into, and is remarkably improved the pregnant of test-tube baby Rate of being pregnent and live birth rate.

Embryonic limb bud cell prochromosome abnormality detection needs to carry out blastula embryo Trophectoderm cells or blastomere single Cell expands, and makes up to the required DNA initial amounts of high-flux sequence platform, i.e., reaches μ g ranks by the DNA of pg ranks DNA content；The unicellular amplification method of mainstream is divided into three classes by principle at present：Unicellular amplification method (such as DOP- of based on PCR PCR)^[1], multiple strand displacement amplification (MDA)^[2]With the cyclic annular cyclic amplification technology (MALBAC) of multiple annealing^[3].Since these are slender Born of the same parents' amplification method is all using tens wheel exponential amplification, this so that the amplification Preference of the certain specific positions of genome is unlimited Amplification, generates a large amount of repetitive sequences (duplicate reads), and the homogeneity that depth is sequenced is caused to significantly reduce, is ultimately caused There are a large amount of exceptional values and false positive results in sample results analysis.Therefore, the repetitive sequence brought by amplification Preference is removed Embryonic limb bud cell prochromosome abnormality detection based on unicellular amplification is very important.

At present, it is all based on reading long counting (reads number) for the detection and analysis of the chromosome abnormality of embryo：It will survey The reading length (reads) that sequence generates is compared into reference gene group；Specific filtration resistance is to the reads to initial position identical in genome (duplicate reads)；Reference gene group is divided into the statistical window of N number of fixed length, counts the reading long number of each window；It is right It reads long number and carries out GC corrections；Reading long number is normalized and is converted into reading long ratio (reads ratio)；Finally count Long ratio (reads ratio) is read in analysis genome to judge that embryo to be measured whether there is chromosome abnormality.Above analysis stream Journey is merely capable of removal in the processing method of removal repetitive sequence (duplicate reads) and compares to identical in genome The duplicate reads of beginning position have for initial position difference but between each other the reads of overlapping (overlap) to be It can not effectively remove.Therefore, it is necessary to using more efficiently removal repetition methods, can just effectively improve based on unicellular complete The accuracy of the chromosome abnormality detection of genome amplification.

Bibliography

[1]Telenius H,Carter NP,Bebb CE,et al.Degenerate oligonucleotide- primed PCR:general amplification of target DNA by a single degenerate primer [J].Genomics,1992,13(3):718-725.

[2]Dean FB,Nelson JR,Giesler TL,et al.Rapid amplification of plasmid and phage DNA using Phi 29DNA polymerase and multiply-primed rolling circle amplification[J].Genome Research,2001,11(6):1095-1099.

[3]Zong C,Lu S,Chapman AR,et al.Genome-wide detection of single- nucleotide and copy-number variations of a single human cell[J].Science,2012, 338(6114):1622-1626.

[4]Olshen A B,Venkatraman E S,Lucito R,et al.Circular binary segmentation for the analysis of array-based DNA copy number data.[J] .Biostatistics,2004,5(4):557-72.

[5]Venkatraman E S,Olshen A B.A faster circular binary segmentation algorithm for the analysis of array CGH data[J].Bioinformatics,2007,23(6): 657-63.

Invention content

In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of chromosome abnormality detection devices.

The technical solution adopted in the present invention is：

A kind of chromosome abnormality detection device, including：

Sequencing data acquiring unit：For obtaining the reading long segment obtained through high-flux sequence；

Comparing unit：It is compared for long segment will to be read with human genome reference sequences, obtains the position for reading long segment Confidence ceases and length information；

Coverage computing unit：For human genome reference sequences to be divided into several first windows, grown according to reading The location information and length information of segment calculate the coverage of each first window, according to the coverage and G/C content of first window Carry out Loess corrections；Several continuous first windows are merged into the second window, after calculating the second window Loess corrections Coverage and its coverage accounting；

Candidate CNV recognition units：For using the breakpoint location of cyclic annular binary segmentation algorithm identification chromosome, calculating adjacent CBS ratio between breakpoint identify candidate CNV regions according to CBS ratio threshold values；

False positive filter element：For calculating the significance P-value of candidate CNV regions CBS ratio values, according to P-value filtering false positives region obtains CNV regions and the results of karyotype of sample to be tested.

Particularly, the base sum/section length covered in coverage=section；The covering of coverage accounting=section Degree/all autosomal coverages.

Particularly, CBS ratio are all second window coverages between the adjacent breakpoint that cyclic annular binary segmentation algorithm identifies The mean value of accounting.

In coverage computing unit, the first window is the non-duplicate section of 10~50Kb, it is preferable that first window Mouth is the non-duplicate section of 20Kb.

In coverage computing unit, second length of window is 0.1~2Mb, it is preferable that second length of window is appointed Selected from 100Kb, 500Kb and 1Mb.

Preferably, in candidate CNV recognition units, the CBS ratio threshold values are [1.4,2.6], are sentenced beyond threshold range It is set to candidate CNV regions.

Preferably, it in false positive filter element, calculates P-value and includes：

Randomly sampled data library is formed according to the result of nominal reference sample, therefrom extracts at least 100000 times and candidate The isometric simulation CBS sections in CNV regions obtain the density profile of simulation CBS ratio values, calculate candidate CNV regions CBS The significance P-value of ratio values.

Preferably, in false positive filter element, the P-value ＜ 0.001 in candidate CNV regions are then determined as CNV regions, Otherwise, as false positive area filter.

Further, described device further includes sequencing unit：

It is connected with sequencing data acquiring unit, for carrying out high-flux sequence, the sample to the library built using sample This is included through unicellular amplification or the sample for expanding through PCR or being expanded in advance without PCR in advance.

Further, described device further includes filter element：

It is connected with comparing unit, for according to comparison result, rejecting in tandem sequence repeats position and transposons repeatable position Reading long segment and low-quality, more matchings and non-fully match the reading long segment on chromosome.

The beneficial effects of the invention are as follows：

Existing chromosome abnormality detection and analysis are based on reading long counting statistics model, can only remove comparison to genome In identical initial position repetitive sequence, reads that is different for initial position but having overlapping between each other can not be removed；This Invention device can effectively remove unicellular whole genome amplification preference by calling sequence coverage (coverage) statistical model Property the repetitive sequence that brings and its overlapping region, significantly improve the homogeneity of data, and then reduce noise data, improve positive sample This recall rate and reduction false positive rate.

Description of the drawings

Fig. 1 is chromosome abnormality testing process schematic diagram；

Fig. 2 is the lower 24 chromosome copies numeric distribution figure of T1 sample 1M resolution ratio；A figures show tradition based on reading length The testing result of counting method, B figures show the testing result provided by the invention based on coverage method；

Fig. 3 is the distribution map of the lower 24 chromosome copies numerical value of T8 sample 1M resolution ratio；A figures show tradition based on reading The testing result of long counting method, B figures show the testing result provided by the invention based on coverage method；

Fig. 4 is the distribution map of the lower 24 chromosome copies numerical value of T19 sample 1M resolution ratio；A figures show that tradition is based on The testing result of long counting method is read, B figures show the testing result provided by the invention based on coverage method；

Fig. 5 is the distribution map of the lower 24 chromosome copies numerical value of T2 sample 1M resolution ratio；A figures show tradition based on reading The testing result of long counting method, B figures show the testing result provided by the invention based on coverage method.

Specific embodiment

The thought of the present invention：For the low sample (such as unicellular sample) of starting DNA content, the unicellular expansion of utilization index type During DNA concentration is promoted to μ g ranks by increasing mode by pg grades, amplification preference is often infinitely amplified, and is generated a large amount of Repetitive sequence (duplicate reads), the homogeneity for causing sample is poor.Traditional chromosome abnormality based on the long counting of reading Analysis method is merely capable of removal in the processing method of removal repetitive sequence (duplicate reads) and compares into genome The duplicate reads of identical initial position have for initial position difference but between each other overlapping (overlap) Reads can not be effectively removed, and therefore, conventional method is for expanding the genome area of preference and the gene of non-amplification preference The obtained sequencing reading length number of group range statistics can difference, eventually lead to the sequence ratios of some regions in analysis result Regular meeting is significantly higher than (or less than) normal condition, so as to false positive results occur.Apparatus of the present invention are in order to avoid unicellular amplification Testing result is influenced, chromosome abnormality is detected using based on coverage (coverage) statistical model, phase can be effectively removed The characteristic of overlapping region between adjacent sequencing reading length is reduced the influence of false positive results brought due to unicellular amplification preference, realized The detection of the chromosome abnormality of high-accuracy.It is visible based on inventive concept：Apparatus of the present invention are applicable not only to need to be through slender Screening before the Embryonic limb bud cell of the trace sample of born of the same parents' amplification, is equally applicable to need the chromosome abnormality of the pre- amplified samples of PCR to detect, such as The chromosome abnormality detection of abortion tissue object is more suitable for the chromosome abnormality detection without the PCR constant samples expanded in advance.This Invention device is a kind of general chromosome abnormality detection device, and more particularly to solve, there are the detections of PCR amplification preference sample Problem embodies more superior detection result.

A kind of chromosome abnormality detection device provided by the invention, including：

In coverage computing unit, second length of window is 0.1~2Mb, it is preferable that the second window length is optional From for 100Kb, 500Kb and 1Mb.

Further, described device further includes sequencing unit：

Further, described device further includes filter element：

Above-mentioned sequencing unit, sequencing data acquiring unit, comparing unit, filter element, coverage computing unit, candidate CNV units, false positive filter element can be program module or hardware device module.

The present invention is explained further below in conjunction with specific embodiment, protection scope of the present invention is without being limited thereto.

Embodiment 1

A kind of chromosome abnormality detection device provided by the invention is applied to the chromosome abnormality based on unicellular amplification In detection technique, following processing step is specifically included, flow diagram is as shown in Figure 1.

1st, sequencing data of whole genome is obtained

Cell strain known to a collection of caryogram is had purchased from Coriell companies, totally 25 samples participate in this item detection, and sample is compiled Number for T1~T25, wherein：2 negative samples；3 sex chromosome abnormalities samples；7 autosome aneuploid samples； The micro- repetition of 1 sex chromosome or micro-deleted sample；The micro- repetition of 12 autosomes or micro-deleted sample；Sample above is carried out single Cell whole genome amplification, library construction and high-flux sequence obtain and read long segment.

2nd, it compares

The reading long segment of acquisition with human genome standard sequence hg19 is compared, each reading long segment is compared to dye Colour solid corresponding position obtains each comparison information for reading long segment, location information, length information and the Quality Control letter including reading long segment Breath.

3rd, it filters

Quality Control information in comparison result rejects the reading lengthy motion picture in tandem sequence repeats position and transposons repeatable position Section and low-quality, more matchings and non-fully match the reading long segment on chromosome.

4th, coverage (coverage) calculates

Human genome reference sequences are divided into several first windows, each first window is the non-overlapping area of 20kb Domain according to the location information and length information of the reading long segment after filtering, calculates the coverage of first window, according to first window Coverage and G/C content to GC Preferences carry out Loess corrections, several continuous first windows are merged into the second window, Each second length of window is 1Mb, calculates coverage and its coverage accounting (coverage after the second window Loess corrections Ratio, abbreviation CR)；Wherein, the base sum/section length covered in coverage=section；Coverage accounting=section Coverage/all autosomal coverages.

5th, candidate CNV is identified

Use cyclic annular binary segmentation algorithm (CBS, Circular Binary Segmentation) algorithm^[4][5]Identification dye The breakpoint location of colour solid sets CBS ratio threshold values as [1.4,2.6], is determined as candidate CNV regions beyond threshold range, no Then it is determined as dye-free body exception, wherein, CBS ratio all second window coverages between the adjacent breakpoint of CBS identifications account for The mean value of ratio.

6th, false positive filters

Randomly sampled data library is formed according to the result of nominal reference sample, therefrom extracts 100000 times and candidate CNV areas The isometric simulation CBS sections in domain obtain the density profile of simulation CBS ratio values, and then calculate candidate CNV regions CBS The significance P-value of ratio values；False positive region is filtered according to the P-value in candidate CNV regions, specially：It is candidate The P-value ＜ 0.001 in CNV regions, then be determined as CNV regions, otherwise, as false positive area filter, finally obtains to be measured The CNV regions of sample and results of karyotype.

Inventor is by the present embodiment (hereinafter referred to as " coverage method ") with traditional based on the chromosome abnormality for reading long counting Detection method (hereinafter referred to as " read long counting method " ") it compares, while the lot sample is originally analyzed using chip method.

Table 1 provides the chromosome abnormality testing result of 25 known caryogram cells, wherein：24 samples are reading long counting method It is identical with testing result under coverage method and consistent with the results of karyotype of chip；Inspection of 1 sample (T2) under two methods It is different to survey result, and the results of karyotype of chip is consistent with the testing result of the present embodiment.It can be seen that dye provided by the invention Colour solid abnormal detector has reliability and accuracy.

The chromosome abnormality testing result of table 1, known caryogram cell

Table 2 provides CV value of the above-mentioned sample respectively using the long counting method of reading and coverage method under 1M resolution ratio, CV values The dispersion degree of data is represented, can reflect the homogeneity that the reading long segment that sequencing obtains is distributed in reference gene group, and then anti- Reflect amplification homogeneity quality.It is clear that coverage method detection CV values are substantially reduced, illustrate chromosome abnormality provided by the invention Detection device can improve the problem of amplification homogeneity is poor.

The CV values of table 2, all samples the 1M resolution ratio under two kinds of detection methods

From above-mentioned sample, T1, T2, T8 and T19 sample are picked as example, further illustrates result.

Fig. 2 illustrates T1 samples 24 chromosome copies numeric distribution situations under 1M resolution ratio, wherein Fig. 2A be based on The testing result of long counting method is read, Fig. 2 B are the testing result based on coverage method.T1 is a negative sample (46, XX), root It can more intuitively illustrate that chromosome abnormality detection device provided by the invention can carry according to the distribution situation at Fig. 2A and Fig. 2 B midpoints The homogeneity of height amplification.

Fig. 3 and Fig. 4 respectively with T8 samples (47, XY ,+15) and T19 samples (46, XX, del (8) (pter-p12)) for, Chromosome aneuploid sample and segment CNV samples 24 chromosome copies numeric distribution situations in 1M resolution ratio are illustrated, Wherein Fig. 3 A, 4A are based on the testing result for reading long counting method, and Fig. 3 B, 5B are the testing result based on coverage method.With reference to figure 3rd, Fig. 4 and table 3 be not it is found that chromosome abnormality detection device provided by the invention influences sun while amplification homogeneity is improved The detection value of property result.

Table 3, T2, T8 and T19 sample the CNV regions detection value under two kinds of detection methods

Fig. 5 illustrates T2 samples (46, XY) 24 chromosome copies numeric distribution situations in 1M resolution ratio, wherein A figures For based on the testing result for reading long counting method, B figures are the testing result based on coverage method.According to point under 1M resolution ratio in Fig. 5 A Distribution situation understand that the amplification homogeneity of T2 samples is poor, use the CV values read when long counting method is detected under 1M resolution ratio It is 0.123, higher than other detection samples, false positive CNV has been detected in based on the testing result for reading long counting method and (has been located at No. 7 Chromosome q11.21 regions, section length about 5M)；And after using the detection device provided by the invention based on coverage method, have Improve to effect the homogeneity (as shown in Figure 5 B) of T2 samples, the CV values under 1M resolution ratio are reduced to 0.073, final detection result It is consistent with chip caryogram, do not occur false positive CNV regions.This is illustrated：When amplification homogeneity is poor, traditional reading length meter Number methods may introduce false positive results, and while being analyzed using chromosome abnormality detection device provided by the invention can improve The homogeneity of amplification reduces the probability that false positive results occur.

It is that the preferred embodiment of the present invention is illustrated above, but the invention is not limited to the implementation Example, those skilled in the art can also make various equivalent variations under the premise of without prejudice to spirit of the invention or replace It changes, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

1. a kind of chromosome abnormality detection device, including：

Comparing unit：It is compared for long segment will to be read with human genome reference sequences, obtains the position letter for reading long segment Breath and length information；

Coverage computing unit：For human genome reference sequences to be divided into several first windows, according to reading long segment Location information and length information, calculate the coverage of each first window, carried out according to the coverage of first window and G/C content Loess is corrected；Several continuous first windows are merged into the second window, calculate the covering after the second window Loess corrections Degree and its coverage accounting；

Candidate CNV recognition units：For using the breakpoint location of cyclic annular binary segmentation algorithm identification chromosome, calculating adjacent breakpoint Between CBS ratio, candidate CNV regions are identified according to CBS ratio threshold values；

False positive filter element：For calculating the significance P-value of candidate CNV regions CBS ratio values, according to P- Value filtering false positives region obtains CNV regions and the results of karyotype of sample to be tested.

2. the apparatus according to claim 1, it is characterised in that：Base sum/the section covered in coverage=section Length；The coverage of coverage accounting=section/all autosomal coverages.

3. the apparatus according to claim 1, it is characterised in that：CBS ratio are the phase of cyclic annular binary segmentation algorithm identification The mean value of all second window coverage accountings between adjacent breakpoint.

4. the apparatus according to claim 1, it is characterised in that：In coverage computing unit, the first window for 10~ The non-duplicate section of 50Kb, it is preferable that the first window is the non-duplicate section of 20Kb.

5. the apparatus according to claim 1, it is characterised in that：In coverage computing unit, second length of window is 0.1~2Mb, it is preferable that second length of window is optionally from 100Kb, 500Kb, 1Mb.

6. the apparatus according to claim 1, it is characterised in that：In candidate CNV recognition units, the CBS ratio threshold values For [1.4,2.6], it is determined as candidate CNV regions beyond threshold range.

7. the apparatus according to claim 1, it is characterised in that：In false positive filter element, calculate P-value and include：

Randomly sampled data library is formed according to the result of nominal reference sample, therefrom extracts at least 100000 times and candidate CNV areas The isometric simulation CBS sections in domain obtain the density profile of simulation CBS ratio values, calculate candidate CNV regions CBS ratio The significance P-value of value.

8. the apparatus according to claim 1, it is characterised in that：In false positive filter element, the P- in candidate CNV regions Value ＜ 0.001 are then determined as CNV regions, otherwise, as false positive area filter.

9. according to claim 1~8 any one of them device, it is characterised in that：Described device further includes sequencing unit：

It is connected with sequencing data acquiring unit, for carrying out high-flux sequence, the sample packet to the library built using sample It includes through unicellular amplification or the sample for expanding through PCR or being expanded in advance without PCR in advance.

10. according to claim 1~8 any one of them device, it is characterised in that：Described device further includes filter element：

It is connected with comparing unit, for according to comparison result, rejecting the reading in tandem sequence repeats position and transposons repeatable position Long segment and low-quality, more matchings and non-fully match the reading long segment on chromosome.