WO2018161245A1 - 一种染色体变异的检测方法及装置 - Google Patents

一种染色体变异的检测方法及装置 Download PDF

Info

Publication number
WO2018161245A1
WO2018161245A1 PCT/CN2017/075858 CN2017075858W WO2018161245A1 WO 2018161245 A1 WO2018161245 A1 WO 2018161245A1 CN 2017075858 W CN2017075858 W CN 2017075858W WO 2018161245 A1 WO2018161245 A1 WO 2018161245A1
Authority
WO
WIPO (PCT)
Prior art keywords
window
depth
sample
unit
normal
Prior art date
Application number
PCT/CN2017/075858
Other languages
English (en)
French (fr)
Inventor
庄雪寒
高雅
陈芳
殷旭阳
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to PCT/CN2017/075858 priority Critical patent/WO2018161245A1/zh
Priority to CN201780085820.7A priority patent/CN110268044B/zh
Publication of WO2018161245A1 publication Critical patent/WO2018161245A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the invention relates to the field of chromosome detection.
  • Non-invasive prenatal testing is a prenatal screening technique that has emerged in recent years to screen fetal fetuses with 21-trisomy, 18-trisomy, 13-trisomy, etc. during early or mid-pregnancy weeks.
  • the risk of chromosome aneuploidy is based on large-scale parallel sequencing of fetal free DNA in the peripheral blood of pregnant women, and analysis of abnormal growth of DNA sequencing signals on specific chromosomes to estimate the risk of fetal disease.
  • non-invasive prenatal testing Compared with traditional methods such as serological screening and ultrasound detection of fetal neck zona pellucida, non-invasive prenatal testing has extremely high sensitivity (>99%) and extremely low false positive rate ( ⁇ 0.5%), which can reduce unnecessary
  • the number of invasive prenatal diagnosis and the number of missed tests, reducing the rate of birth defects, its clinical effectiveness has been proved by a large number of international and international clinical studies, so it is rapidly applied in clinical.
  • this detection technique has its limitations. First, it only has a good detection effect on the 21-trisomy, 18-trisomy, and 13-trisomy. The second is only for chromosome aneuploidy. Chromosomal abnormalities have a good detection effect. Therefore, this detection technique lacks a good detection effect for other types of chromosomal abnormalities, especially small regional chromosome abnormalities such as copy number variation such as chromosomal deletion. And copy number variation such as chromosomal deletions can lead to serious clinical manifestations such as miscarriage, stillbirth, fetal malformation, neonatal developmental delay, and mental retardation. More than 1% of pregnancies have clinically significant deletions/repetitions. DECIPHER, which contains more than 70 microdeletion/repetition-related syndromes, is important for prenatal testing of chromosome copy number variation.
  • the existing technical solutions for detecting chromosomal variation based on low-depth whole genome data generally include three steps: one is a data correction step, the other is a segmentation step, and the third is a micro-deletion/repetition region step, which are respectively described below.
  • the data correction step it is mainly the correction of the sequence alignment ability and the correction of the GC content.
  • the reference genome sequence is interrupted into sequences identical to the sequence of the sequenced samples, and the sequences are re-aligned back into the reference genome; the whole gene is continuously divided into several sliding or non-sliding, fixed or For an unfixed window, the number of sequences falling in each window is counted, and a reference value for each window sequence matching ability is obtained; this reference value is used to correct the number of sequences of each window of the sample to be tested.
  • Window Depth Correction Counts the GC content of the reference genomic sequence interrupted in all windows, obtaining depth and GC content Direct correlation, and using the regression model, each window depth of the sample to be tested is corrected for its GC content.
  • the above-mentioned corrected data is segmented by the binary segmentation algorithm, and the windows of the same copy number are successively divided into the same segment, so that the micro-deleted/repetitive segments can be separately and continuously divided.
  • the sequence depth of the segment obtained after the segmentation is calculated, and compared with the depth of all the windows of the sample, and the segment having an absolute value greater than 3 is determined as the microdeletion/repetition region by calculating the t value.
  • chromosome copy number variation for small data (for example, 10 Mb or less) is not strong: According to the simulation data, the above detection accuracy of chromosome copy number variation is above 10 Mb, and a high ratio of free nucleic acid is required. (10%) (Chen S, Lau TK, Zhang C, et al. A method for noninvasive detection of fetal large deletions/duplications by low coverage massively parallel sequencing. Prenat Diagn. 2013 Jun; 33(6): 584-90.) For the chromosome copy number variation of less than 10Mb or lower, the detection rate of the free nucleic acid ratio is greatly reduced.
  • the present invention provides a method for chromosomal mutation detection based on low depth whole genome data.
  • the present invention provides a method for detecting chromosomal variation, comprising:
  • sequencing the sample to be tested containing the nucleic acid to obtain a sequencing result composed of several sequencing data
  • the sample to be tested is peripheral blood.
  • the peripheral blood is peripheral blood from a pregnant woman.
  • the sequencing is high throughput sequencing.
  • the nucleic acid is DNA.
  • the copy number variation is a microdeletion, a microrepetition or a combination thereof.
  • the normal data set is established using sequencing data for several normal samples.
  • the establishing the normal data set using the sequencing data of several normal samples comprises:
  • the reference genome is successively divided into a plurality of fourth windows, a matrix is established according to the depth of each of the fourth windows, and the depth of each fourth window is corrected according to the matrix.
  • step (0-1) comprises:
  • (0-1-1) interrupting the reference genome into a number of reads of the same length, and then comparing the reads to the reference genome;
  • step (0-2) comprises:
  • the GC content and depth of the second window of all normal samples are counted, and the correlation between the overall GC content of all normal samples and the window depth is obtained;
  • the inter-sample correction is performed on the depth of the second window by using a regression model according to the correlation and the GC content of the second window.
  • the regression model of step (0-2-2) is a LOESS regression model.
  • step (0-3) comprises:
  • the CV value of the third window of each of the same positions is equal to the variance of the third window depth divided by an average value.
  • step (0-4) comprises:
  • step (2) comprises:
  • step (2-3) reading each third window depth of the sample to be tested corrected by step (2-3), and comparing the average depth value of the third window of the normal sample to each third window of the sample to be tested Depth correction;
  • the second window, the third window, and the fourth window described in steps 2-2, 2-3, 2-4, and 2-5 are all obtained by successively dividing according to the reference genome.
  • the second window, the third window, and the fourth window divided by the normal data set construction can be directly used without re-dividing the window.
  • the regression model of step (2-2) is a LOESS regression model.
  • step (2-5) comprises:
  • (2-5-1) establishing a matrix according to each fourth window depth of the normal sample, and performing principal component analysis on the matrix to obtain a feature vector matrix of the matrix;
  • step (3) comprises:
  • a fragment having an absolute value of z value greater than a predetermined value is marked as a potential copy number variation fragment.
  • the predetermined value is 3.
  • step (4) comprises:
  • (4-1) calculating, for each of said potential copy number variant fragments, a log-to-number ratio of said potential copy number variation fragment and a log ratio of a chromosome of said potential copy number variation fragment;
  • the predetermined value is zero.
  • the present invention provides a device for detecting chromosomal variation, comprising:
  • sample sequencing unit to be tested, the sample sequencing unit to be used for sequencing a sample to be tested containing nucleic acid, and obtaining a sequencing result composed of several sequencing data;
  • sample correcting unit to be tested is connected to the sample sequencing unit to be tested, and used to correct the sequencing result using a normal data set;
  • the dividing unit is connected to the sample correcting unit to be tested, and is configured to segment the corrected sequencing result to obtain a plurality of data segments;
  • the detecting unit is connected to the dividing unit, and configured to detect whether the plurality of data segments are copy number variation segments.
  • a normal data set construction unit the normal data set construction unit being connected to the sample correction unit to be tested for establishing a normal data set with the sequencing data of the plurality of normal samples.
  • the sample to be tested is peripheral blood.
  • the peripheral blood is peripheral blood from a pregnant woman.
  • the sequencing is high throughput sequencing.
  • the nucleic acid is DNA.
  • the copy number variation is a microdeletion, a microrepetition or a combination thereof.
  • the normal data set construction unit comprises:
  • a reference gene contrast capability determining unit configured to divide the reference genome into a plurality of first windows, and determine a comparison capability value of each first window
  • the normal sample correlation unit is connected to the reference gene comparison capability determining unit, configured to divide the reference genome into a plurality of second windows, and determine a GC content and a second window depth in each normal sample. Correlation, for each of the second windows, performing intra-sample and inter-sample correction on a depth of the second window by using a GC content of the second window;
  • the group area correcting unit is connected to the normal sample correlation unit, and is configured to divide the reference genome into a plurality of third windows according to an average depth value of the third window at the same position between the normal samples. Correction of the population area by the depth of each third window;
  • the matrix unit is connected to the group area correction unit, configured to divide the reference genome into a plurality of fourth windows, and establish a matrix according to depths of the fourth windows, and fourth according to the matrix The depth of the window is corrected.
  • the reference gene comparison capability determining unit comprises:
  • Interrupting unit for interrupting the reference genome into a plurality of reads of the same length, and comparing the reads to the reference genome
  • first window unit the first window unit being connected to the interrupting unit, configured to divide the reference genome into a plurality of the first windows, wherein a length of the first window is greater than a length of the read length;
  • the first deleting unit is connected to the first window unit, configured to count the number of readings located in each first window, and delete the number of readings less than a predetermined number of first windows And/or, calculating a ratio of the repeating regions in each of the first windows, and deleting the first window in which the proportion of the repeating regions is greater than a predetermined ratio;
  • a first comparison capability correction unit wherein the first comparison capability correction unit is connected to the first deletion unit, configured to calculate the first deletion that is not deleted for each first window that is not deleted in the reference genome The average number of reads of the window, and the average number of reads is divided by the number of reads of each of the first windows that are not deleted, to obtain the comparison capability values of the first windows that are not deleted, respectively.
  • the normal sample correlation unit comprises:
  • a second comparison capability correction unit configured to compare the sequencing data of the normal samples into the reference genome, and perform correction of the comparison capability values on the reads of the normal samples
  • a normal intra-sample window depth correction unit connected to the second comparison capability correction unit for continuously dividing the reference genome into a plurality of the second windows, for each normal sample, counting a depth of each second window and a GC content, obtaining a correlation between the GC content in each normal sample and the window depth; and using the regression model to the second according to the correlation and the GC content of the second window The depth of the window is corrected within the sample;
  • a normal inter-sample overall window depth correction unit wherein the normal inter-sample overall window depth correction unit is connected to the normal intra-sample window depth correction unit for counting all normal samples for all normal samples after performing intra-sample window depth correction
  • the GC content and depth of the second window obtaining the correlation between the overall GC content of all normal samples and the window depth; and for each of the second windows, according to the correlation and the GC content of the second window Using the regression model, the depth of the second window is corrected between samples.
  • the group area correction unit comprises:
  • a second deleting unit configured to divide the reference genome into a plurality of the third windows, calculate an average value and a variance of the third window depth of each same position of all normal samples, and calculate each same position
  • the CV value of the third window is deleted, the third window having the CV value greater than a predetermined value is deleted, wherein a CV value of the third window of each of the same positions is equal to a variance of the third window depth Divide by the average;
  • a first depth correction unit configured to calculate an average depth value of all the third windows that are not deleted, and use the average depth value to each The depth of the deleted third window is corrected.
  • the matrix unit comprises:
  • a first principal component analyzing unit configured to divide the reference genome into a plurality of the fourth windows, establish a matrix according to depths of the fourth windows, and perform principal component analysis on the matrix to obtain the a eigenvector matrix of the matrix;
  • the second depth correction unit being coupled to the first principal component analysis unit for each
  • the normal sample is subjected to principal component analysis, and a predetermined number of principal components of each normal sample are deleted, and then multiplied by an inverse matrix of the feature vector matrix to obtain a depth of each of the windows corrected by the principal component analysis.
  • the sample correcting unit to be tested includes:
  • a third comparison capability correction unit configured to compare the sequencing data of the sample to be tested into a reference genome, and perform correction of a comparison capability value on each read segment of the sample to be tested;
  • the window depth correction unit in the sample to be tested is connected to the third comparison capability correction unit, and configured to count the depth of each of the second windows and the GC content, and obtain the Measuring a correlation between a GC content in the sample and a window depth; and for each of the second windows, using a regression model to determine the depth of the second window based on the correlation and the GC content of the second window Perform calibration within the sample;
  • An inter-sample correction unit which is connected to the window depth correction unit in the sample to be tested, and is configured to use the regression model to measure the sample according to the correlation between the GC content and the second window depth in the normal sample. Determining the inter-samples between the second window depths corrected by the window depth correction unit in the sample to be tested;
  • the third depth correction unit is connected to the inter-sample correction unit, configured to read each third window depth of the sample to be tested corrected by the inter-sample correction unit, according to the third of the normal sample The average depth value of the window is corrected for the depth of each third window of the sample to be tested;
  • the fourth depth correction unit is connected to the third depth correction unit, configured to read each fourth window depth of the sample to be tested corrected by the third depth correction unit, according to each normal sample
  • the matrix established by the depth of the fourth window corrects the depth of each fourth window of the sample to be tested.
  • the regression model of the inter-sample correction unit is a LOESS regression model.
  • the fourth depth correction unit comprises:
  • a matrix establishing unit configured to establish a matrix according to a depth of each fourth window of the normal sample, and perform principal component analysis on the matrix to obtain a feature vector matrix of the matrix
  • the principal component correction depth unit is connected to the matrix establishment unit, and configured to read each fourth window depth of the sample to be tested corrected by the third depth correction unit, and the sample to be tested Multiplying the depth of each fourth window by the eigenvector matrix, obtaining the principal component of the sample to be tested, deleting the pre-preset number of principal components of the sample to be tested, and multiplying the inverse matrix of the eigenvector matrix to obtain
  • the principal component analyzes the depth of each window after correction.
  • the dividing unit comprises:
  • the same copy number unit is configured to segment the sequencing result corrected by the sample correcting unit to be tested to obtain a plurality of segments having the same copy number;
  • a potential copy number variation fragment labeling unit the potential copy number variation fragment labeling unit being coupled to the z value calculation unit for marking a fragment having an absolute value of the z value greater than a predetermined value as a potential copy number variation fragment.
  • the predetermined value is 3.
  • the detecting unit comprises:
  • a logarithmic generation ratio calculation unit for calculating the potential copy number variation fragment for each potential copy number variation fragment Log-ratio ratio and the logarithm of the chromosome of the potential copy number variant fragment;
  • a copy number variation fragment determining unit configured to mark the potential copy number variation fragment as a copy number when a logarithm of a potential copy number variation fragment occurs less than a predetermined value and a logarithm generation ratio of the chromosome is greater than a predetermined value Variation fragment.
  • the predetermined value is zero.
  • the method and apparatus for detecting chromosomal variation since a normal data set is established by using a plurality of normal samples, and the sequencing data of the sample to be tested is corrected using the normal data set, the probability of chromosomal abnormality detection is reduced, and the probability is reduced. False positives and false negatives have higher detection accuracy for chromosome non-integral and chromosome copy number variation, and can detect smaller chromosome copy number variation at low fetal depth.
  • FIG. 1 is a schematic flow chart of a method for detecting chromosomal variation according to an embodiment of the present application
  • step S100 is a schematic flow chart of step S100 in the method for detecting chromosomal variation according to an embodiment of the present application
  • step S110 in the method for detecting chromosomal variation according to an embodiment of the present application
  • step S130 is a schematic flow chart of step S130 in the method for detecting chromosomal variation according to an embodiment of the present application
  • step S150 is a schematic flow chart of step S150 in the method for detecting chromosomal variation according to an embodiment of the present application
  • step S170 is a schematic flow chart of step S170 in the method for detecting chromosomal variation according to an embodiment of the present application
  • step S300 is a schematic flow chart of step S300 in the method for detecting chromosomal variation according to an embodiment of the present application
  • FIG. 8 is a schematic flowchart of step S390 in the method for detecting chromosomal variation according to an embodiment of the present application
  • step S500 is a schematic flow chart of step S500 in the method for detecting chromosomal variation according to an embodiment of the present application.
  • step S700 is a schematic flow chart of step S700 in the method for detecting chromosomal variation according to an embodiment of the present application
  • FIG. 11 is a schematic structural diagram of a device for detecting chromosomal variation according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a normal data set construction unit in a chromosomal variation detecting apparatus according to an embodiment of the present application
  • FIG. 13 is a schematic structural diagram of a reference gene comparison capability determining unit in a chromosomal variation detecting apparatus according to an embodiment of the present application
  • FIG. 14 is a schematic structural diagram of a normal sample correlation unit in a chromosomal variation detecting apparatus according to an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a population area correcting unit in a chromosomal variation detecting apparatus according to an embodiment of the present application.
  • 16 is a schematic structural diagram of a matrix unit in a chromosomal variation detecting apparatus according to an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a sample correcting unit to be tested in a chromosomal variation detecting apparatus according to an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a fourth depth correction unit in a chromosomal variation detecting apparatus according to an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a dividing unit in a chromosomal variation detecting apparatus according to an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a detecting unit in a chromosomal variation detecting apparatus according to an embodiment of the present application
  • 21 is an image curve of a logarithmic ratio of chromosomes of a sample to be tested in an example
  • Figure 22 is an image curve of the logarithmic ratio of chromosome 9 in an example
  • Figure 23 is an image curve of the logarithmic ratio of chromosome 21 in an example
  • Figure 24 is an image curve of the logarithmic ratio of chromosome 18 in an example
  • Fig. 25 is an image curve showing the logarithmic ratio of chromosome 10 in an example.
  • the present invention overcomes the deficiencies in the existing data correction methods, and reduces the probability of chromosomal abnormality detection caused by using the same batch of samples for comparison in the prior art; False-positive and false-negative results of detection results caused by bias; resolution of simultaneous detection of non-holographic detection of chromosomes (including autosomal abnormalities and sex chromosome abnormalities) and detection of chromosome copy number variation; improvement of chromosome aneuploidy The detection effect of the chimera; the detection effect of chromosome copy number variation below 10 Mb and at low free nucleic acid ratio; the bias of the data and the false positive and false negative rates of the resulting test results are reduced.
  • the sample of the present invention is a biological sample containing nucleic acid.
  • the normal sample of the present invention is a sample which is normal in karyotype by amniocentesis or chorionic villus sampling, and is determined by the prior art to have no chromosome number variation and copy number variation.
  • the reads of the present invention are nucleic acid sequencing sequences obtained by one reaction in high throughput sequencing, also referred to as reads.
  • the window of the present invention is a number of segments having fixed size values that are divided on the reference genome as needed. For example, a 500 bp window, a 2 kbp window, and the like.
  • the window depth of the present invention is the number of reads compared to the window multiplied by the length of the read, divided by the length of the window.
  • the above formula for calculating the depth of the window can be preset in the computer, and the window depth value can be directly obtained according to the calculation formula during the statistics.
  • the window of the same position as described in the present invention is a window in which the different samples are aligned to the same segment on the reference genome.
  • Fragments of the invention are nucleic acid sequences of varying lengths on a chromosome.
  • the segment depth of the present invention is the number of reads in the segment multiplied by the length of the segment and divided by the length of the segment.
  • the above formula for calculating the depth of the segment may be preset in a computer, and the segment depth value may be directly obtained according to the calculation formula in statistics.
  • the repeat region of the present invention is a region in which a tandem repeat sequence exists in a nucleic acid sequence.
  • the correction within the sample of the invention is a correction of all nucleic acid sequencing data within a sample.
  • correction between samples is the correction of all nucleic acid sequencing data between different samples.
  • correction of population regions of the invention is the correction of nucleic acid sequencing data for population samples on the same reference genome segment.
  • the normal data set of the present invention is a collection of nucleic acid sequencing data for samples without chromosome number variation and copy number variation.
  • the embodiment discloses a method for detecting chromosomal variation, which includes a sample sequencing step S000 to be tested, a sample sequencing data correction step S300 to be tested, a segmentation step S500, and a detection step S700. The details are described below.
  • Step S000 Sequencing the sample to be tested containing the nucleic acid to obtain a sequencing result composed of several sequencing data.
  • the sample to be tested is peripheral blood from a pregnant woman.
  • the nucleic acid is DNA.
  • the sequencing is a second generation high throughput sequencing, such as using the BGISEQ-50 sequencing platform.
  • the normal data set may be constructed in advance before the sample to be tested is sequenced in step S000. Preferably, it is preset in the computer system, and can be directly called when used; the normal data set may also be constructed after the sample to be tested is sequenced in step S000.
  • the method for detecting chromosomal variation may further include step S100, which is specifically described below.
  • Step S100 Establish a normal data set using nucleic acid sequencing data of several normal samples.
  • nucleic acid sequencing data for 200 normal samples can be used to establish a normal data set.
  • step S100 includes steps S110-S170.
  • Step S110 The reference genome is successively divided into a plurality of first windows having a fixed length, and the comparison capability values of the respective first windows are determined. Specifically, referring to FIG. 3, in an embodiment, step S110 may include steps S111-S117.
  • Step S111 Breaking the reference genome into a plurality of reads having the same length and comparing the reads back to the reference genome. Different read lengths are selected according to different sequencing platforms. The length of the read is usually 25-200 bp. For example, the reference genome is broken into reads of 35 bp size and these reads are compared back to the reference genome.
  • Step S113 The above reference genome is successively divided into a plurality of first windows having a fixed length, wherein the length of the first window is greater than the length of the read segment.
  • each window is 500 bp in length, that is, the reference genome is continuously divided into a number of 500 bp non-overlapping first windows.
  • Step S115 Counting the number of read segments located in each first window, and deleting the number of read segments smaller than the predetermined number of first windows; and/or calculating the proportion of the repeated regions in each first window, and repeating the regions
  • the first window with a ratio greater than a predetermined ratio is deleted.
  • the predetermined number is usually a value obtained by multiplying the number of normal samples by 0.01.
  • Step S117 Calculate, for each first window that is not deleted in the reference genome, the average number of reads of all the first windows that are not deleted, and divide the average number of read segments by the reading of each of the first windows that are not deleted. The number of segments to obtain the comparison capability value (ie, the ratio value) of each of the first windows that are not deleted.
  • Step S130 The reference genome is continuously divided into a plurality of second windows having a fixed length, and the correlation between the GC content in each normal sample and the second window depth is determined. For each second window, the GC content of the second window is utilized. The depth of the second window is corrected between the sample and the sample.
  • step S130 may include steps S131-S135.
  • Step S131 Aligning the sequencing data of each of the normal samples into the reference genome, and correcting the reading ability values of the readings of the normal samples. For example, the sequencing data of 200 normal samples are aligned into the reference genome for correction of the comparison ability value. In one embodiment, the correction of the comparison ability value may be assigned to each read of the normal sample. The comparison ability value of the reference genome corresponding to the window.
  • Step S133 The reference genome is continuously divided into a plurality of second windows having a fixed length, and for each normal sample, the depth of each of the second windows and the GC content are counted, and the GC content and the window depth in each normal sample are obtained. Correlation; and for each second window, based on the correlation and the GC content of the second window, the depth of the second window is corrected within the sample using a regression model.
  • the reference genome is continuously divided into a number of non-overlapping 500kbp lengths.
  • a second window that counts the depth and CG content of each second window of each normal sample to obtain a correlation between GC content and depth in each normal sample; using a LOESS regression model, based on each second window
  • the GC content is related to the correlation, and the depth of each second window is corrected within the sample; in an embodiment, the depth of each second window is corrected in the sample in step S133, that is, the corrected depth is equal to The depth before correction is divided by the correction factor.
  • the correction coefficient is derived from the correlation between the GC content and the depth in each normal sample by the LOESS regression model.
  • Step S135 For all normal samples after intra-sample correction, the GC content and depth of the second window of all normal samples are counted, and the correlation between the overall GC content and the window depth of all normal samples is obtained;
  • the second window performs the inter-sample correction on the depth of the second window by using the regression model according to the correlation and the GC content of the second window.
  • the GC content and the depth of all the second windows corrected by the step S133 are determined by the 200 normal samples, and the correlation between the GC content and the depth of the 200 normal samples is obtained; and the LOESS regression model is used again for each The depth of each second window of the sample is corrected between samples.
  • the depth between each of the second windows is corrected in the step S135, that is, the corrected depth is equal to the depth before the correction divided by the correction coefficient, wherein the correction coefficient is determined by the LOESS regression model for 200 normal samples.
  • the correlation between the overall GC content and depth is derived.
  • Step S150 The reference genome is successively divided into a plurality of third windows having a fixed length, and the depths of the third windows are corrected according to the average depth values of the third windows.
  • step S150 may include steps S151 and S153.
  • Step S151 The reference genome is successively divided into a plurality of third windows having a fixed length, the average value and the variance of the third window depths of the same positions of all normal samples are counted, and the third window of each of the same positions of all normal samples is calculated.
  • the CV value is deleted by a third window having a CV value greater than a predetermined value, wherein a CV value of the third window of each same position is equal to the window depth variance divided by the window average depth value.
  • the reference genome is successively divided into a plurality of non-overlapping third windows of length 100 kbp, and the average and variance of the third window depth of each of the same positions of the 200 normal samples are counted, thereby obtaining the CV of each third window.
  • a CV value of any one of the third windows is equal to a variance of a third window depth of the same location in the 200 normal samples divided by an average depth value of the window; and a third value of the CV value greater than a predetermined value (eg, 0.25)
  • a predetermined value eg, 0.25
  • Step S153 Correcting the depth of each of the third windows that are not deleted by using all the third window depth average values that are not deleted.
  • the depth of any third window is corrected in step S153, and the third window average depth value of the same position is divided by the depth of the third window to obtain the corrected depth of the third window. .
  • Step S170 The reference genome is successively divided into a plurality of fourth windows having a fixed length, and a matrix is established according to the depth of each fourth window, and the depth of each fourth window is corrected according to the matrix.
  • step S170 may include steps S171 and S173.
  • Step S171 The reference genome is successively divided into a plurality of fourth windows having a fixed length, a matrix is established according to the depth of each fourth window, and principal component analysis is performed on the matrix to obtain a feature vector matrix of the matrix.
  • the reference genome is successively divided into a plurality of non-overlapping fourth windows having a length of 500 kbp, and a principal component analysis is performed on a matrix composed of the depths corrected by the step S153 for each of the 200 normal samples, that is, the calculation is obtained.
  • Eigenvector matrix Eigenvector matrix.
  • Step S173 Perform principal component analysis on each normal sample, and delete a preset number of main components of each normal sample. Divide and multiply the inverse matrix of the eigenvector matrix to obtain the depth of each window corrected by the principal component analysis. For example, after performing principal component analysis on each normal sample, the top ten principal components are deleted, which can remove many influencing factors, including bias between different batches of samples, different environments from sample sources, and others. Noise, etc.; after that, a depth file for each fourth window corrected by PCA (Principal Component Analysis) can be obtained.
  • PCA Principal Component Analysis
  • Step S300 Correcting the sample to be tested using the normal data set described above.
  • step S300 employs a 5-step correction, which includes steps S310-S390.
  • Step S310 Compare the sequencing data of the sample to be tested into the reference genome, and perform correction of the comparison ability value for each read segment of the sample to be tested.
  • the correction of the comparison ability value may be that each read of the sample to be tested is given a comparison capability value of the corresponding reference genome of the reference genome.
  • Step S330 Statistics the depth of each second window and the GC content, and obtain the correlation between the GC content in the sample to be tested and the window depth; and for each second window, use the regression according to the correlation and the GC content of the window.
  • the model performs a correction within the sample for the depth of the second window.
  • Step S330 is used to perform intra-sample correction on the second window depth of the sample to be tested.
  • step S330 may be: using a second window of 500 kbp, and calculating the depth of all second windows in the genome-wide range of the sample to be tested and its GC. The content is obtained for correlation; the LOESS regression model is used to correct the depth of each second window with the correlation.
  • Step S350 Perform correction between the samples according to the correlation between the GC content of the normal sample and the second window depth by using the regression model to the second window depth corrected by the step S330 of the sample to be tested.
  • Step S350 is to perform inter-sample correction for the sample to be tested.
  • step S350 may be: an overall window depth and GC content correlation file obtained by using 200 normal sample data, and the corrected sample is subjected to step S330. Each second window depth is corrected between samples, and the LOESS regression model is still used.
  • Step S370 Read each third window depth of the sample to be tested corrected in step S350, and correct each third window depth of the sample to be tested according to the average depth value of the third window of the normal sample. For example, an area information file having a stable depth obtained by using 200 normal sample data, and a correction of each third window depth corrected by step S350 of the sample to be tested, that is, each of the normal samples obtained in step S153 is not deleted.
  • the average depth of the third window is divided by the depth of each corresponding third window after the sample to be tested is corrected by step S350, and the depth of each third window corresponding to the corrected sample to be tested is obtained.
  • Step S390 Read each fourth window depth of the sample to be tested corrected in step S370, and correct each fourth window depth of the matrix to be tested according to the depth of the fourth window of each normal sample.
  • step S390 may include steps S391 and S393.
  • Step S391 The reference genome is successively divided into a plurality of non-overlapping fourth windows having a fixed length, a matrix is established according to the depth of each fourth window, and principal component analysis is performed on the matrix to obtain a feature vector matrix of the matrix.
  • step S171 is present, step S391 can be omitted.
  • Step S393 reading each fourth window depth of the sample to be tested corrected in step S370, multiplying the depth of each fourth window of the sample to be tested by the feature vector matrix of the matrix, and obtaining the principal component of the sample to be tested, which is to be The pre-preset number of principal components of the sample is deleted, and then multiplied by the inverse matrix of the eigenvector matrix to obtain the depth of each fourth window corrected by the principal component analysis of the sample to be tested.
  • Step S500 Segmenting the corrected sequencing data of the sample to be tested to obtain a plurality of data segments.
  • step S500 includes steps S510-S550.
  • Step S510 Segmenting the sequenced data of the sample to be tested corrected in step S393 to obtain a plurality of segments having the same copy number.
  • a binary segmentation algorithm is used (for a specific procedure, please refer to Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572.)
  • the sample data to be tested corrected in step S393 is segmented to obtain segments having the same copy number.
  • Step S550 Marking a segment whose absolute value of the z value is greater than a predetermined value as a potential copy number variation segment.
  • Step S700 Detect whether each data segment is a copy number variation segment.
  • step S700 includes steps S710 and S730.
  • Step S710 Calculate, for each potential copy number variation segment, a log-to-number ratio of the potential copy number variation segment and a logarithm ratio of the chromosome of the potential copy number variation segment.
  • Step S730 When the logarithm of a potential copy number variation segment occurs less than a predetermined value and the logarithm generation ratio of the chromosome in which it is located is greater than a predetermined value, the potential copy number variation segment is marked as a copy number variation segment.
  • the copy number variation (CNVs) fragment is a microdeletion fragment, or a microrepeat fragment, or a combination thereof.
  • LOG Odds RATIO use the statistical method of LOG Odds RATIO to test whether the potential copy number variation fragment is true or false: calculate the logarithmic occurrence value (LOG Odds RATIO value) of each potential copy number variation fragment, and calculate the chromosome of the fragment.
  • Logarithmic occurrence value (LOG Odds RATIO value) when the LOG Odds RATIO value of the chromosome is greater than 0, and the LOG Odds RATIO value of the fragment is less than 0, the potential copy number variation fragment is considered to be a copy number variation fragment.
  • the logarithmic hair growth value is calculated as follows:
  • f is the ratio of free nucleic acid of the sample to be tested, and the ratio of free nucleic acid is calculated according to the method disclosed in the patent "Method for determining the ratio of free nucleic acid in biological sample, device and its use" (Application No.: PCT/CN2015/085109);
  • the z value is calculated by referring to the z value calculation formula disclosed in the above step S530, wherein when the "log ratio of the chromosome of the potential copy number variation fragment is calculated" is calculated, the chromosome is regarded as a fragment in the z value calculation formula.
  • Z,f) are the posterior probabilities of CNVs and normal regions under a certain Z value and free nucleic acid ratio, respectively.
  • P(affected) and P(euploid) are prior probabilities that the fragment is a CNV or a normal region, respectively.
  • euploid, f) are conditional probabilities that the fragment is a CNV or a normal region at a certain free nucleic acid ratio.
  • the present application also discloses a detecting device for chromosomal variation.
  • the sample sequencing unit 000 to be tested the sample correcting unit 300 to be tested, the dividing unit 500 and the detecting unit 700 are included.
  • the sample to be tested sequencing unit 000 is used to sequence the sample to be tested containing the nucleic acid to obtain a sequencing result composed of several sequencing data.
  • the sample to be tested is peripheral blood from a pregnant woman.
  • the nucleic acid is DNA.
  • the sequencing is a second generation high throughput sequencing, such as using the BGISEQ-50 sequencing platform.
  • the normal data set may be the sample to be tested in the sample sequencing unit to be tested. It has been constructed in advance and preset in the computer system, and can be directly called when used; the normal data set can also be constructed after the sample to be tested by the sample sequencing unit 000 is finished sequencing.
  • the detection device for chromosomal variation may further include a normal data set construction unit 100, which is specifically described below.
  • the normal data set construction unit 100 is for establishing a normal data set with nucleic acid sequencing data for several normal samples.
  • the normal data set construction unit 100 can use the nucleic acid sequencing data of 200 normal samples to establish a normal data set.
  • the normal data set construction unit 100 includes a reference gene comparison capability determination unit 110, a normal sample correlation unit 130, a population region correction unit 150, and a matrix unit 170.
  • the reference gene comparison ability determining unit 110 is configured to successively divide the reference genome into a plurality of first windows having a fixed length, and determine the matching ability values of the respective first windows. Specifically, referring to FIG. 13, in an embodiment, the reference gene contrast capability determining unit 110 may interrupt the unit 111, the first window unit 113, the first deleting unit 115, and the first matching capability correcting unit 117.
  • the interrupting unit 111 is configured to break the reference genome into a plurality of reads having the same length, and then compare the reads back to the reference genome. Different read lengths are selected according to different sequencing platforms. The length of the read is usually 25-200 bp. For example, the disruption unit 111 breaks the reference genome into reads of 35 bp size and compares the reads back to the reference genome.
  • the first window unit 113 is connected to the interrupting unit 111, and configured to divide the reference genome into a plurality of the first windows having a predetermined length, wherein the length of the first window is greater than Read the length of the segment.
  • each window in the first window unit 113 has a length of 500 bp, that is, the reference genome is continuously divided into a number of 500 bp non-overlapping windows.
  • the first deleting unit 115 is connected to the first window unit 113, for counting the number of readings located in each first window, and deleting the number of readings less than a predetermined number of first windows And/or, calculating the ratio of the repeating regions in each of the first windows, and deleting the first window in which the ratio of the repeating regions is greater than a predetermined ratio (for example, 20%).
  • the predetermined number is usually a value obtained by multiplying the number of normal samples by 0.01.
  • the first comparison capability correction unit 117 is connected to the first deletion unit 115, and configured to calculate all the first windows that are not deleted for each of the first windows that are not deleted in the reference genome.
  • the average number of read segments is divided by the number of read segments of each of the first windows that are not deleted, respectively, to obtain the comparison capability values (ie, ratio values) of the first windows that are not deleted.
  • the normal sample correlation unit 130 is connected to the reference gene comparison capability determining unit 110 for continuously dividing the reference genome into a plurality of second windows having a fixed length to determine the GC content and the second in each normal sample. Correlation of window depth, for each second window, utilizing the GC content of the second window for the second window Depth correction between the sample and the sample.
  • the normal sample correlation unit 130 may include a second comparison capability correction unit 131, a normal intra-sample window depth correction unit 133, and a normal inter-sample overall window depth correction unit 135.
  • the second comparison capability correcting unit 131 is configured to compare the sequencing data of each of the normal samples to the reference genome, and perform correction of the comparison capability value for the read of each normal sample.
  • the second comparison capability correction unit 131 compares the sequencing data of 200 normal samples into the reference genome for correction of the comparison ability value.
  • the correction of the comparison ability value may be normal.
  • Each read of the sample gives the alignment ability value of the corresponding window of the reference genome in which it is located.
  • a normal intra-sample window depth correction unit 133 which is connected to the second comparison capability correction unit 131 for continuously dividing the reference genome into a plurality of the second windows, for each normal sample, The depth of each of the second windows and the GC content are counted, and the correlation between the GC content in each normal sample and the window depth is obtained; and for each second window, based on the correlation and the GC content of the second window, The regression model performs intra-sample corrections on the depth of the second window.
  • the normal intra-sample window depth correcting unit 133 divides the reference genome into a plurality of non-overlapping second windows having a length of 500 kbp, and counts the depth and CG content of each second window of each normal sample, thereby obtaining each normal. Correlation between GC content and depth in the sample; using the LOESS regression model, the intra-sample correction is performed on the depth of each second window according to the GC content of each second window and the correlation; in an embodiment
  • the normal intra-sample window depth correcting unit 133 performs intra-sample correction on the depth of each second window, that is, the corrected depth is equal to the depth before the correction divided by the correction coefficient, and the correction coefficient is determined by the LOESS regression model for each normal sample.
  • the correlation between GC content and depth is derived by regression.
  • the normal inter-sample overall window depth correction unit 135 is connected to the normal intra-sample window depth correction unit 133 for counting all normal samples for all normal samples after performing intra-sample window depth correction.
  • the GC content and depth of the second window the correlation between the overall GC content of all normal samples and the window depth is obtained; and for each of the second windows, based on the correlation and the GC content of the second window, The depth of the second window is corrected between samples using a regression model.
  • the normal inter-sample overall window depth correcting unit 135 counts the GC content and depth of all the second windows corrected by the 200 normal samples by the window depth correcting unit 133, and obtains the correlation between the GC content and the depth of the 200 normal samples as a whole.
  • the inter-sample correction is performed for the depth of each second window of each sample.
  • the normal inter-sample overall window depth correction unit 135 performs inter-sample correction on the depth of each second window, that is, the corrected depth is equal to the depth before the correction divided by the correction coefficient, wherein the correction coefficient is determined by LOESS.
  • the regression model was derived by regressing the correlation between GC content and depth of 200 normal samples.
  • the group area correcting unit 150 is configured to divide the reference genome into a plurality of third windows having a fixed length, and correct the depth of each third window according to the average depth value of each third window.
  • the group area correction unit 150 may include a second deletion unit 151 and a first depth correction unit 153.
  • the second deleting unit 151 is configured to continuously divide the reference genome into a plurality of the third windows having a fixed length, calculate an average value and a variance of the third window depths of the same positions of all the normal samples, and calculate all the normal samples.
  • the CV value of the third window of each identical position is deleted by a third window having a CV value greater than a predetermined value, wherein a CV value of the third window of each same position is equal to the window depth variance divided by the window average depth value.
  • the second deletion unit 151 will reference base Since the group is continuously divided into a plurality of non-overlapping third windows having a length of 100 kbp, the average value and the variance of the third window depth of each of the same positions of the 200 normal samples are counted, thereby obtaining the CV value of each third window, wherein The CV value of any one of the third windows is equal to the variance of the third window depth of the same position in the 200 normal samples divided by the window average; the third window having a CV value greater than a predetermined value (eg, 0.25) is deleted because This shows that the third window is highly volatile and unstable.
  • a predetermined value eg, 0.25
  • the first depth correction unit 153 is configured to correct the depth of each of the third windows that are not deleted, using the average depth values of all the third windows that are not deleted.
  • the depth of any third window is corrected in the first depth correction unit 153, and the same position third window average depth value is divided by the depth of the third window to obtain the third window. The corrected depth.
  • the matrix unit 170 is configured to divide the reference genome into a plurality of fourth windows having a fixed length, and establish a matrix according to the depth of each fourth window, and correct the depth of each fourth window according to the matrix.
  • the matrix unit 170 may include a first principal component analyzing unit 171 and a second depth correcting unit 173.
  • the first principal component analyzing unit 171 is configured to divide the reference genome into a plurality of the fourth windows, establish a matrix according to the depth of each fourth window, and perform principal component analysis on the matrix to obtain a feature vector of the matrix. matrix. For example, the first principal component analyzing unit 171 successively divides the above-mentioned reference genome into a plurality of non-overlapping fourth windows having a length of 500 kbp, and constructs a depth corrected by the first depth correcting unit 153 for each of the 200 normal samples.
  • the matrix is subjected to principal component analysis, that is, the eigenvector matrix is obtained by calculation.
  • the second depth correction unit 173 is configured to perform principal component analysis on each normal sample, delete a preset number of principal components of each normal sample, and multiply the inverse matrix of the eigenvector matrix to obtain principal component analysis correction. The depth of each window afterwards. For example, the second depth correcting unit 173 deletes the first ten principal components after performing principal component analysis on each normal sample, so that many influencing factors can be removed, including the bias between different batch samples, and the sample source. Different environments, and other noises; after that, the depth file of each fourth window corrected by PCA (Principal Component Analysis) can be obtained.
  • PCA Principal Component Analysis
  • the sample-correcting unit 300 to be tested is used to correct the sample to be tested using the normal data set described above.
  • the sample correcting unit 300 to be tested adopts a 5-unit correction, which includes a third comparing capability correcting unit 310, an intra-sample in-process window depth correcting unit 330, and an inter-sample correcting unit 350.
  • the three depth correction unit 370 and the fourth depth correction unit 390 are included in the sample correcting unit 300 to be tested.
  • the third comparison capability correction unit 310 is configured to compare the sequencing data of the sample to be tested into the reference genome, and perform correction of the comparison capability value for each read segment of the sample to be tested.
  • the correction of the comparison ability value may be that each read of the sample to be tested is given a comparison capability value of the corresponding reference genome of the reference genome.
  • the intra-sample in-sample window depth correction unit 330 is configured to calculate the depth of each second window and the GC content, and obtain the correlation between the GC content in the sample to be tested and the window depth; and for each second window, according to the correlation and The GC content of the window is corrected for the depth of the second window using a regression model.
  • the intra-sample in-process window depth correction unit 330 is configured to perform intra-sample correction for the second window depth of the sample to be tested.
  • the intra-sample intra-sample depth correction unit 330 may be: using a second window of 500 kbp, the statistical The depth of all the second windows in the whole genome of the sample and its GC content were measured to obtain the correlation; using the LOESS regression model and the correlation, the depth of each second window was sampled. Correction.
  • the inter-sample correction unit 350 is configured to perform a sample of each second window depth corrected by the window depth correction unit 330 of the sample to be tested according to the correlation between the GC content of the normal sample and the second window depth. Correction between.
  • the inter-sample correction unit 350 is configured to perform calibration of the GC content between samples for the sample to be tested.
  • the inter-sample correction unit 350 may be: an overall window depth and GC content correlation file obtained by using 200 normal sample data, to be tested. The correction of the samples is performed for each second window depth corrected by the window depth correction unit 330 in the sample to be tested, and the LOESS regression model is still used.
  • the third depth correction unit 370 is configured to read each third window depth of the sample to be tested corrected by the inter-sample correction unit 350, and perform the third window depth of the sample to be tested according to the average depth value of the third window of the normal sample. Correction.
  • the third depth correction unit 370 uses a region information file having a stable depth obtained from 200 normal sample data, and corrects each third window depth corrected by the inter-sample correction unit 350 of the sample to be tested, that is, the first depth.
  • the average depth of each third window that is not deleted in the normal sample obtained by the correcting unit 153 is divided by the depth of each corresponding third window corrected by the inter-sample correcting unit 350, and the corrected sample to be tested is obtained. Each corresponds to the depth of the third window.
  • the fourth depth correction unit 390 is configured to read each fourth window depth of the sample to be tested corrected by the third depth correction unit 370, and establish the matrix to be tested according to the depth of the fourth window of each normal sample. Each fourth window depth is corrected.
  • the fourth depth correction unit 390 may include a matrix establishing unit 391 and a principal component correction depth unit 393.
  • the matrix establishing unit 391 is configured to divide the reference genome into a plurality of non-overlapping fourth windows having a fixed length, establish a matrix according to the depth of each fourth window, and perform principal component analysis on the matrix to obtain a feature vector of the matrix. matrix.
  • the matrix establishing unit 391 can be omitted.
  • the principal component correction depth unit 393 is configured to read each fourth window depth of the sample to be tested corrected by the third depth correction unit 370, and multiply the depth of each fourth window of the sample to be tested by the feature vector matrix to obtain The principal component of the sample to be tested is deleted from the pre-preset number (for example, ten) principal components of the sample to be tested, and then multiplied by the inverse matrix of the feature vector matrix to obtain the depth of each window after the principal component analysis is corrected.
  • the segmentation unit 500 is configured to segment the sequenced data of the sample to be tested to obtain a plurality of data segments.
  • the segmentation unit 500 includes the same copy number unit 510, the z value calculation unit 530, and the potential copy number variation segment label unit 550.
  • the same copy number unit 510 is configured to segment the sequenced data of the sample to be tested corrected by the sample correcting unit 300 to obtain a plurality of segments having the same copy number.
  • the same copy number unit 510 utilizes a binary segmentation algorithm (for a specific procedure, please refer to Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572.)
  • the sample data to be tested corrected by the principal component corrected depth unit 393 is segmented to obtain segments having the same copy number.
  • the latent copy number variation fragment labeling unit 550 is for marking a fragment having an absolute value of the z value greater than a predetermined value as a potential copy number variation fragment.
  • the detecting unit 700 is configured to detect whether each data segment is a copy number variation segment.
  • the detecting unit 700 includes a logarithm generation ratio calculating unit 710 and a copy number variation segment determining unit 730.
  • the logarithm generation ratio calculation unit 710 is configured to calculate, for each potential copy number variation segment, a log generation ratio of the copy number variation segment and a logarithmic ratio of the chromosome of the potential copy number variation segment.
  • the copy number variation segment determining unit 730 is configured to mark the potential copy number variation segment as a copy number when the logarithm of a potential copy number variation segment occurs less than a predetermined value and the logarithm generation ratio of the chromosome in which the chromosome is greater than a predetermined value Variation fragment.
  • the copy number variation (CNVs) fragment is a microdeletion fragment, or a microrepeat fragment, or a combination thereof.
  • the copy number variation segment determining unit 730 uses the statistical method of LOG Odds RATIO to check whether the potential copy number variation segment is true or false: calculating the log generation value (LOG Odds RATIO value) of each potential copy number variation segment, while Calculating the logarithmic occurrence value (LOG Odds RATIO value) of the chromosome in which the fragment is located.
  • LOG Odds RATIO value of the chromosome is greater than 0 and the LOG Odds RATIO value of the fragment is less than 0, the potential copy number variation fragment is considered to be a copy number variation fragment;
  • the calculation of the logarithmic hair growth value is as follows:
  • f is the ratio of free nucleic acid of the sample to be tested, and the ratio of free nucleic acid is calculated according to the method disclosed in the patent "Method for determining the ratio of free nucleic acid in biological sample, device and its use" (Application No.: PCT/CN2015/085109);
  • the z value is calculated by referring to the z value calculation formula disclosed in the above step S530, wherein when the "log ratio of the chromosome of the potential copy number variation fragment is calculated" is calculated, the chromosome is regarded as a fragment in the z value calculation formula.
  • Z,f) are the posterior probabilities of CNVs and normal regions under a certain Z value and free nucleic acid ratio, respectively.
  • P(affected) and P(euploid) are prior probabilities that the fragment is a CNV or a normal region, respectively.
  • euploid, f) are conditional probabilities that the fragment is a CNV or a normal region at a certain free nucleic acid ratio.
  • the above is the detection method and device for chromosomal variation disclosed in the present application, which uses the same batch of samples and a certain number of normal samples as a control to reduce the possibility of chromosomal abnormality detection; the 5-step calibration method used in the sample to be tested, In particular, the principal component correction method can effectively remove the bias between different batches of data; the combined fragment test method is adopted (calculating the logarithm of the latent copy number variation fragment and the logarithm of the chromosome in which it is located, when the latent copy When the logarithm of the number of fragments is less than a predetermined value and the logarithm of the chromosome is greater than a predetermined value, the potential copy number variation fragment is marked as a copy number variation fragment), which can effectively reduce false positives and false negatives;
  • the existing technology expands the scope of detection, and has higher detection accuracy for chromosome aneuploidy and chromosome copy number variation, and can be detected under low free nucleic acid ratio conditions. Smaller chro
  • the first is to build a normal data set with normal samples.
  • the depth of each window that has not been deleted is corrected for the depth of each window of each sample that has not been deleted.
  • PCA principal component analysis
  • the second is to correct the sample to be tested.
  • correction of alignment ability The sequencing data of the sample to be tested is compared to the reference genome, and the correction ability of each reading is corrected. That is, each read of the sample to be tested is given the comparison ability value of the corresponding window of the reference genome.
  • PCA correction using the 500 kbp window, reading the window depth information of the sample to be tested corrected by step (4), and the feature vector matrix obtained by 200 normal samples (method of constructing a normal data set using normal samples) The obtained information is multiplied to obtain the principal component of the sample to be tested, and the first ten principal components are deleted, and then multiplied by the inverse matrix of the feature vector matrix to obtain the depth information file of each window corrected by the PCA.
  • Specific steps can be found in the literature: Chen Zhao, John Tynan, Mathias Ehrich et al. Detection of Fetal Subchromosomal Abnormalities by Sequencing Circulating Cell-Free DNA from Maternal Plasma. Clinical Chemistry 61:4 608-616, 2015.
  • the logarithmic hair growth value is calculated as follows:
  • f is the ratio of free nucleic acid of the sample to be tested, and the ratio of free nucleic acid is calculated according to the method disclosed in the patent "Method for determining the ratio of free nucleic acid in biological sample, device and its use" (Application No.: PCT/CN2015/085109); z value, reference The calculation is performed according to the z-value calculation formula disclosed in the above step S530, in which the chromosome is regarded as a segment in the z-value calculation formula when the "log ratio of the chromosome in which the potential copy number variation fragment is located" is calculated.
  • Z,f) are the posterior probabilities of CNVs and normal regions under a certain Z value and free nucleic acid ratio, respectively.
  • P(affected) and P(euploid) are prior probabilities that the fragment is a CNV or a normal region, respectively.
  • euploid, f) are conditional probabilities that the fragment is a CNV or a normal region at a certain free nucleic acid ratio.
  • FIG. 21 is an image of logarithm ratio (logRatio) of the sample to be tested, that is, the number of readings per window of each chromosome after the data of the sample to be tested is corrected, and the average number of readings in the whole genome of the sample.
  • logRatio logarithm ratio
  • FIG. 22 is a logRatio curve of chromosome 9, wherein the abscissa is the index value of chromosome 9, the ordinate is the logRaito value of the sample to be tested; the point in the figure indicates that the sample to be tested is on chromosome 9.
  • the logRaito value of each window; the black line is the segment obtained by the binary segmentation algorithm, wherein a black segment below the 0 reference line is an area where a micro-deletion occurs.
  • FIG. 23 is a logRatio curve of chromosome 21, wherein the abscissa is the index value of chromosome 21, the ordinate is the logRaito value of the sample to be tested; the point in the figure indicates that the sample to be tested is on each window of chromosome 21
  • the logRaito value; the black line is the segment obtained by the binary segmentation algorithm, wherein a black segment above the 0 reference line is a region where microrepetition occurs.
  • FIG. 24 is a logRatio curve of chromosome 18, wherein the abscissa is the index value of chromosome 18, the ordinate is the logRaito value of the sample to be tested; the point in the figure indicates that the sample to be tested is on chromosome 18
  • the logRaito value of the window; the black line is the segment obtained by the binary segmentation algorithm, wherein the black segment above the 0 reference line is the region where the microrepetition occurs, and the sample is seen to be the 18 chromosome 3 body.
  • FIG. 25 it is a logRatio curve of chromosome 10, wherein the abscissa is the index value of chromosome 10, the ordinate is the logRaito value of the sample to be tested; the point in the figure indicates that the sample to be tested is on chromosome 10
  • the logRaito value of the window; the black line is the segment obtained by the binary segmentation algorithm, wherein the black segment above the 0 reference line is the region where the microrepetition occurs, and the copy number of the sample on chromosome 10 is abnormally increased, but the non-completeness is not reached.
  • the threshold value of the ploidy is the chimera of the chromosome 10 trisomy.
  • chromosome 18 was detected in the above three cases; two cases of chromosome 16 trisomy; one case of XO; three cases of chromosome trisomy chimerism; and 8 cases of chromosomal microdeletions/repetitions, of which 6 cases of microdeletions/repeats were smaller than 10M, the minimum is 1.1M.
  • the above test results were all verified by amniotic fluid or cord blood sequencing, which is completely consistent with the test results of this application.
  • the present application can detect higher precision copy number variation, such as copy number variation below 1 M; and detect copy number variation at a lower ratio of free nucleic acids, such as less than 5% free nucleic acid ratio.
  • the method and apparatus for detecting chromosomal variation disclosed in the present application may include diagnostic and non-diagnostic uses for human or animal diseases; and for non-diagnostic purposes, the method and apparatus for detecting chromosomal variation disclosed in the present application may be applied to scientific research. In addition, it can also be applied to the detection of plant chromosomal variation, in which plant chromosomal variation can be expressed as a change in the genetic trait of the plant.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种染色体变异的检测方法及装置,所述装置包括待测样本测序单元,用于对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果;待测样本校正单元,用于使用正常数据集对待测样本进行校正;分割单元,用于对校正后的测序结果进行分割获得若干数据片段;以及检测单元,用于检测各数据片段是否为拷贝数变异片段。本申请降低了染色体异常漏检的概率,降低了假阳性和假阴性,对染色体非整体和染色体拷贝数变异具有更高的检测准确性,且能在低胎儿深度条件下检测出片段更小的染色体拷贝数变异。

Description

一种染色体变异的检测方法及装置 技术领域
本发明涉及染色体检测领域。
背景技术
无创产前检测(NIPT)是一项近几年才出现的产前筛查技术,用于在早孕周或者中孕周筛查胎儿罹患21-三体、18-三体、13-三体等染色体非整倍体的风险,其基本原理是对孕妇外周血中的胎儿游离DNA进行大规模平行测序,分析特定染色体上DNA测序信号是否发生不正常增多的现象,用以估算胎儿患病风险。相对于血清学唐筛和超声检测胎儿颈部透明带等传统方法,无创产前检测具有极高的灵敏性(>99%)和极低的假阳性率(<0.5%),能够降低不必要的侵入性产前诊断数量和漏检数量,降低出生缺陷率,其临床有效性已经得到国际和国际大量临床研究证明,因此在临床上得到快速应用。
然而该项检测技术有其局限性,一是仅针对21-三体、18-三体、13-三体这三种染色体有较好的检测效果,二是仅针对染色体非整倍体这种染色体异常有较好的检测效果。因此,该项检测技术针对其他种类的染色体异常,尤其是染色体缺失重复等拷贝数变异等微小的区域性染色体异常缺乏较好的检测效果。而染色体缺失重复等拷贝数变异能够导致流产、死产、胎儿畸形、新生儿发育迟缓及智力障碍等严重临床表现,超过1%的妊娠存在具有临床意义的缺失/重复,目前国际上权威的数据库DECIPHER,收录了70多种微缺失/重复相关的综合症,因此开展对染色体拷贝数变异的产前检测十分重要。
近年来针对染色体拷贝数变异的产前检测的技术的改进,大数都是通过增加测序深度以获取更多的测序数据来实现,这是因为相比非整倍体这种变异,染色体拷贝数变异涉及的区域相对较小,因此为了提高检出率,手段之一就是增加测序深度。沿着增加测序深度的方向来对染色体拷贝数变异的产前检测的技术进行改进,虽然能检出部分染色体拷贝数变异,但同时也大幅增加了检测成本,不具备临床实用价值。
因此,基于低深度全基因组测序数据进行染色体拷贝数变异检测,检测的难度非常大。目前有一些文献报道基于低深度全基因组数据检测小数据量(例如10Mb)以上的染色体拷贝数变异,而能检测小数据量(例如10Mb)以下的染色体缺失重复的方法及临床验证数据几乎没有看过相关报道。
现有的基于低深度全基因组数据进行染色体变异检测的技术方案,大体包括三个步骤:一是数据校正步骤,二是片段分割步骤,三是确定微缺失/重复的区域步骤,下面分别说明。
一、数据校正步骤:
在数据校正步骤中,主要是对序列比对能力的校正和GC含量的校正。
序列比对能力的校正:将参考基因组序列打断成与测序样本相同读段的序列,将这些序列又重新比对回参考基因组中;将全基因连续划分成若干个滑动或不滑动,固定或不固定的窗口,对落在每个窗口的序列数目进行统计,获得每个窗口序列比对能力的参考值;用这个参考值对待测样本每个窗口的序列数目做校正。
窗口深度校正:统计所有窗口中打断的参考基因组序列的GC含量,获得深度和GC含量 直接的相关性,并利用回归模型,对待测样本的每个窗口深度依据其GC含量进行校正。
二、片段分割步骤:
利用二元分割算法对以上校正后的数据进行片段分割,将相同拷贝数的窗口连续划分到同一片段中,从而可以将微缺失/重复的片段单独连续划分出来。
三、确定微缺失/重复的区域步骤:
计算分割后获得的片段的序列深度,与样本所有窗口的深度进行比较,通过计算t值,将绝对值大于3的片段确定为微缺失/重复的区域。
以上的基于低深度全基因组数据进行染色体变异检测的技术方案存在如下缺陷:
(1)数据校正存在缺陷:上述方案在对数据进行校正时,采用同一批次样本校正的策略,默认同一批次的样本为正常基线样本,对单个待测样本采用同批次其他样本进行数据校正。这样做的缺陷是,一旦同一批次的样本中存在相同的染色体缺失/重复,就将会造成数据校正错误,导致该位置的染色体异常信号漏检。
(2)无法解决不同批次测序样本之间由于实验环境、试剂和样本特性等造成的数据偏向性:由于上述方案采用同一批次样本校正的策略,忽略了不同批次样本间的差异,因此会导致校正后的数据依然存在偏向性,即在基因组某些区域出现假性数据富集或缺失的现象,从而产生假阳性或假阴性结果。
(3)对性染色体异常和嵌合体的检出效果不明显:上述只针对检测染色体拷贝数变异设计,仅评估了对染色体拷贝数变异的检出表现,对于性染色体异常、嵌合体检测并没有专门的设计和评估。
(4)针对小数据(例如10Mb以下)的染色体拷贝数变异的检出能力不强:根据模拟实验数据,上述方案对染色体拷贝数变异的检测精度在10Mb以上,且需要较高的游离核酸比例(10%)(Chen S,Lau TK,Zhang C,et al.A method for noninvasive detection of fetal large deletions/duplications by low coverage massively parallel sequencing.Prenat Diagn.2013Jun;33(6):584-90.),对于小于10Mb的染色体拷贝数变异或者更低的游离核酸比例检出率会大大降低。
发明内容
针对现有技术中存在的缺陷,本发明提供一种基于低深度全基因组数据进行染色体变异检测的方法。
第一方面,本发明提供一种染色体变异的检测方法,包括:
(1)对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果;
(2)使用正常数据集对所述测序结果进行校正;
(3)对校正后的测序结果进行分割,获得若干数据片段;以及
(4)检测所述若干数据片段是否为拷贝数变异片段。
根据本发明的实施例,所述待测样本为外周血。
根据本发明的实施例,所述外周血为来自于孕妇的外周血。
根据本发明的实施例,所述测序为高通量测序。
根据本发明的实施例,所述核酸为DNA。
根据本发明的实施例,所述拷贝数变异为微缺失、微重复或其组合。
根据本发明的实施例,使用若干正常样本的测序数据建立所述正常数据集。
根据本发明的实施例,所述使用若干正常样本的测序数据建立所述正常数据集包括:
(0-1)将参考基因组连续划分成若干第一窗口,并确定各第一窗口的比对能力值;
(0-2)将参考基因组连续划分成若干第二窗口,确定各正常样本内GC含量与第二窗口深度的相关性,对于每一个所述第二窗口,利用所述第二窗口的GC含量对所述第二窗口的深度进行样本内与样本间的校正;
(0-3)将参考基因组连续划分成若干第三窗口,根据各正常样本间相同位置的第三窗口的平均深度值对各第三窗口的深度进行群体区域的校正;以及
(0-4)将参考基因组连续划分成若干第四窗口,根据各所述第四窗口的深度建立一矩阵,根据所述矩阵对各第四窗口的深度进行校正。
根据本发明的实施例,优选地,步骤(0-1)包括:
(0-1-1)将参考基因组打断成若干相同长度的读段,再将所述读段比对回所述参考基因组;
(0-1-2)将所述参考基因组连续划分成若干所述第一窗口,其中所述第一窗口的长度大于所述读段的长度;
(0-1-3)统计位于各第一窗口中的读段的数量,并将读段的数量小于一预定数量的第一窗口删除;和/或,计算各第一窗口中的重复区域比例,并将重复区域比例大于一预定比例的第一窗口删除;以及
(0-1-4)对于参考基因组中各未被删除的第一窗口,计算所述未被删除的第一窗口的平均读段数量,并将平均读段数量分别除以各未被删除的第一窗口的读段数量,以分别获得各未被删除的第一窗口的比对能力值。
根据本发明的实施例,优选地,步骤(0-2)包括:
(0-2-1)将所述各正常样本的测序数据比对到所述参考基因组中,并对各正常样本的读段进行比对能力值的校正;
(0-2-2)将所述参考基因组连续划分成若干所述第二窗口,对于每个正常样本,统计其各第二窗口的深度以及GC含量,获得每个正常样本内的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行的样本内的校正;以及
(0-2-3)对于进行样本内校正后的所有正常样本,统计所有正常样本的第二窗口的GC含量及深度,获得所有正常样本的整体的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本间的校正。
根据本发明的实施例,优选地,步骤(0-2-2)所述回归模型为LOESS回归模型。
根据本发明的实施例,优选地,步骤(0-3)包括:
(0-3-1)将所述参考基因组连续划分成若干所述第三窗口,统计所有正常样本的每个相同位置的所述第三窗口深度的平均值及方差,并计算每个相同位置的所述第三窗口的CV值,将 所述CV值大于一预定值的第三窗口删除;以及
(0-3-2)计算所有未被删除的第三窗口的平均深度值,并用所述平均深度值对各未被删除的第三窗口的深度进行校正。
根据本发明的实施例,优选地,所述每个相同位置的所述第三窗口的CV值等于所述第三窗口深度的方差除以平均值。
根据本发明的实施例,优选地,步骤(0-4)包括:
(0-4-1)将所述参考基因组连续划分成若干所述第四窗口,根据各所述第四窗口的深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
(0-4-2)对每个所述正常样本进行主成分分析,将每个正常样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各所述窗口的深度。
根据本发明的实施例,步骤(2)包括:
(2-1)将所述待测样本的测序数据比对到参考基因组中,对所述待测样本的各读段进行比对能力值的校正;
(2-2)统计各所述第二窗口的深度以及GC含量,获得所述待测样本内的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型对所述第二窗口的深度进行样本内的校正;
(2-3)根据正常样本中GC含量与第二窗口深度的相关性,利用回归模型对所述待测样本的经步骤(2-2)校正后的各第二窗口深度进行样本间的校正;
(2-4)读取经步骤(2-3)校正后的待测样本的各第三窗口深度,根据正常样本的第三窗口的平均深度值对所述待测样本的各第三窗口的深度进行校正;以及
(2-5)读取经步骤(2-4)校正后的待测样本的各第四窗口深度,根据各正常样本的第四窗口的深度建立的所述矩阵对所述待测样本的各第四窗口的深度进行校正。
需要说明的是,步骤2-2、2-3、2-4和2-5所述第二窗口、第三窗口和第四窗口均依据参考基因组进行连续划分而获得。可以直接使用正常数据集构建时所划分的第二窗口、第三窗口和第四窗口,而不需要再重新划分窗口。
根据本发明的实施例,优选地,步骤(2-2)所述回归模型为LOESS回归模型。
根据本发明的实施例,优选地,步骤(2-5)包括:
(2-5-1)根据正常样本的各第四窗口深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
(2-5-2)读取经步骤(2-4)校正后的待测样本的各第四窗口深度,将所述各第四窗口的深度乘以所述特征向量矩阵,获得待测样本的主成分,将待测样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。
根据本发明的实施例,步骤(3)包括:
(3-1)对校正后的测序结果进行分割获得若干具有相同拷贝数的片段;
(3-2)对于每一个所述片段,计算所述片段的z值,其中z值=(待测样本所述片段的深度-正常样本在所述片段对应片段的平均深度)/正常样本在所述片段对应片段的方差;以及
(3-3)将z值的绝对值大于一预定值的片段标记为潜在拷贝数变异片段。
根据本发明的实施例,优选地,所述预定值为3。
根据本发明的实施例,步骤(4)包括:
(4-1)对于每一个所述潜在拷贝数变异片段,计算所述潜在拷贝数变异片段的对数发生比以及所述潜在拷贝数变异片段所在染色体的对数发生比;以及
(4-2)当一潜在拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将所述潜在拷贝数变异片段标记为拷贝数变异片段。
根据本发明的实施例,优选地,所述预定值为0。第二方面,本发明提供一种染色体变异的检测装置,包括:
待测样本测序单元,所述待测样本测序单元用于对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果;
待测样本校正单元,所述待测样本校正单元与待测样本测序单元相连,并且用于使用正常数据集对所述测序结果进行校正;
分割单元,所述分割单元与所述待测样本校正单元相连,并用于对校正后的测序结果进行分割,获得若干数据片段;以及
检测单元,所述检测单元与所述分割单元相连,并用于检测所述若干数据片段是否为拷贝数变异片段。
根据本发明的实施例,还进一步包括正常数据集构建单元,所述正常数据集构建单元与待测样本校正单元相连,用于用若干正常样本的测序数据建立正常数据集。
根据本发明的实施例,所述待测样本为外周血。
根据本发明的实施例,所述外周血为来自于孕妇的外周血。
根据本发明的实施例,所述测序为高通量测序。
根据本发明的实施例,所述核酸为DNA。
根据本发明的实施例,所述拷贝数变异为微缺失、微重复或其组合。
根据本发明的实施例,所述正常数据集构建单元包括:
参考基因对比能力确定单元,用于将参考基因组连续划分成若干第一窗口,并确定各第一窗口的比对能力值;
正常样本相关性单元,所述正常样本相关性单元与所述参考基因对比能力确定单元相连,用于将参考基因组连续划分成若干第二窗口,确定各正常样本内GC含量与第二窗口深度的相关性,对于每一个所述第二窗口,利用所述第二窗口的GC含量对所述第二窗口的深度进行样本内与样本间的校正;
群体区域校正单元,所述群体区域校正单元与所述正常样本相关性单元相连,用于将参考基因组连续划分成若干第三窗口,根据各正常样本间相同位置的第三窗口的平均深度值对各第三窗口的深度进行群体区域的校正;以及
矩阵单元,所述矩阵单元与所述群体区域校正单元相连,用于将参考基因组连续划分成若干第四窗口,根据各所述第四窗口的深度建立一矩阵,根据所述矩阵对各第四窗口的深度进行校正。
根据本发明的实施例,优选地,所述参考基因对比能力确定单元包括:
打断单元,用于将参考基因组打断成若干相同长度的读段,再将所述读段比对回所述参考基因组;
第一窗口单元,所述第一窗口单元与所述打断单元相连,用于将所述参考基因组连续划分成若干所述第一窗口,其中所述第一窗口的长度大于所述读段的长度;
第一删除单元,所述第一删除单元与所述第一窗口单元相连,用于统计位于各第一窗口中的读段的数量,并将读段的数量小于一预定数量的第一窗口删除;和/或,计算各第一窗口中的重复区域比例,并将重复区域比例大于一预定比例的第一窗口删除;以及
第一比对能力校正单元,所述第一比对能力校正单元与所述第一删除单元相连,用于对于参考基因组中各未被删除的第一窗口,计算所述未被删除的第一窗口的平均读段数量,并将平均读段数量分别除以各未被删除的第一窗口的读段数量,以分别获得各未被删除的第一窗口的比对能力值。
根据本发明的实施例,优选地,所述正常样本相关性单元包括:
第二比对能力校正单元,用于将所述各正常样本的测序数据比对到所述参考基因组中,并对各正常样本的读段进行比对能力值的校正;
正常样本内窗口深度校正单元,所述正常样本内窗口深度校正单元与第二比对能力校正单元相连,用于将参考基因组连续划分成若干所述第二窗口,对于每个正常样本,统计其各第二窗口的深度以及GC含量,获得每个正常样本内的GC含量与窗口深度的相关性;并根据所述相关性与所述第二窗口的GC含量,利用回归模型对所述第二窗口的深度进行样本内的校正;以及
正常样本间整体窗口深度校正单元,所述正常样本间整体窗口深度校正单元与所述正常样本内窗口深度校正单元相连,用于对于进行样本内窗口深度校正后的所有正常样本,统计所有正常样本的第二窗口的GC含量及深度,获得所有正常样本的整体的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本间的校正。
根据本发明的实施例,优选地,所述群体区域校正单元包括:
第二删除单元,用于将所述参考基因组连续划分成若干所述第三窗口,统计所有正常样本的每个相同位置的所述第三窗口深度的平均值及方差,并计算每个相同位置的所述第三窗口的CV值,将所述CV值大于一预定值的第三窗口删除,其中所述每个相同位置的所述第三窗口的CV值等于所述第三窗口深度的方差除以平均值;以及
第一深度校正单元,所述第一深度校正单元与所述第二删除单元相连,用于计算所有未被删除的所述第三窗口的平均深度值,并用所述平均深度值对各未被删除的所述第三窗口的深度进行校正。
根据本发明的实施例,优选地,所述矩阵单元包括:
第一主成分分析单元,用于将所述参考基因组连续划分成若干所述第四窗口,根据各所述第四窗口的深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
第二深度校正单元,所述第二深度校正单元与所述第一主成分分析单元相连,用于对每个 所述正常样本进行主成分分析,将每个正常样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各所述窗口的深度。
根据本发明的实施例,所述待测样本校正单元,包括:
第三比对能力校正单元,用于将所述待测样本的测序数据比对到参考基因组中,对所述待测样本的各读段进行比对能力值的校正;
待测样本内窗口深度校正单元,所述待测样本内窗口深度校正单元与所述第三比对能力校正单元相连,用于统计各所述第二窗口的深度以及GC含量,获得所述待测样本内的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本内的校正;
样本间校正单元,所述样本间校正单元与所述待测样本内窗口深度校正单元相连,用于根据正常样本中GC含量与第二窗口深度的相关性,利用回归模型对待测样本的经所述待测样本内窗口深度校正单元校正后的各第二窗口深度进行样本间的校正;
第三深度校正单元,所述第三深度校正单元与所述样本间校正单元相连,用于读取经样本间校正单元校正后的待测样本的各第三窗口深度,根据正常样本的第三窗口的平均深度值对所述待测样本的各第三窗口的深度进行校正;
第四深度校正单元,所述第四深度校正单元与所述第三深度校正单元相连,用于读取经第三深度矫正单元校正后的待测样本的各第四窗口深度,根据各正常样本的第四窗口的深度建立的所述矩阵对所述待测样本的各第四窗口的深度进行校正。
根据本发明的实施例,优选地,样本间校正单元所述回归模型为LOESS回归模型。
根据本发明的实施例,优选地,所述第四深度校正单元包括:
矩阵建立单元,用于根据正常样本的各第四窗口的深度建立一矩阵,并对该矩阵进行主成分分析,获得该矩阵的特征向量矩阵;以及
主成分校正深度单元,所述主成分校正深度单元与所述矩阵建立单元相连,用于读取经第三深度校正单元校正后的待测样本的各第四窗口深度,将所述待测样本的各第四窗口的深度乘以所述特征向量矩阵,获得待测样本的主成分,将待测样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。
根据本发明的实施例,所述分割单元包括:
相同拷贝数单元,用于对经待测样本校正单元校正后的测序结果进行分割获得若干具有相同拷贝数的片段;
z值计算单元,所述z值计算单元与所述相同拷贝数单元相连,用于对于每一个所述片段,计算所述片段的z值,其中z值=(待测样本所述片段的深度-正常样本在所述片段对应片段的平均深度)/正常样本在所述片段对应片段的方差;以及
潜在拷贝数变异片段标记单元,所述潜在拷贝数变异片段标记单元与所述z值计算单元相连,用于将z值的绝对值大于一预定值的片段标记为潜在拷贝数变异片段。
根据本发明的实施例,优选地,所述预定值为3。
根据本发明的实施例,所述检测单元包括:
对数发生比计算单元,用于对于每一个潜在拷贝数变异片段,计算该潜在拷贝数变异片段 的对数发生比以及该该潜在拷贝数变异片段所在染色体的对数发生比;以及
拷贝数变异片段确定单元,用于当一潜在拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将该潜在拷贝数变异片段标记为拷贝数变异片段。
根据本发明的实施例,优选地,所述预定值为0。
依据上述实施例的染色体变异的检测方法及装置,由于用若干正常样本建立正常数据集,并使用所述正常数据集对待测样本的测序数据进行校正,从而降低了染色体异常漏检的概率,降低了假阳性和假阴性,对染色体非整体和染色体拷贝数变异具有更高的检测准确性,且能在低胎儿深度条件下检测出片段更小的染色体拷贝数变异。
附图说明
图1是本申请一实施例的染色体变异的检测方法的流程示意图;
图2是本申请一实施例的染色体变异的检测方法中步骤S100的流程示意图;
图3是本申请一实施例的染色体变异的检测方法中步骤S110的流程示意图;
图4是本申请一实施例的染色体变异的检测方法中步骤S130的流程示意图;
图5是本申请一实施例的染色体变异的检测方法中步骤S150的流程示意图;
图6是本申请一实施例的染色体变异的检测方法中步骤S170的流程示意图;
图7是本申请一实施例的染色体变异的检测方法中步骤S300的流程示意图;
图8是本申请一实施例的染色体变异的检测方法中步骤S390的流程示意图;
图9是本申请一实施例的染色体变异的检测方法中步骤S500的流程示意图;
图10是本申请一实施例的染色体变异的检测方法中步骤S700的流程示意图;
图11是本申请一实施例的染色体变异的检测装置的结构示意图;
图12是本申请一实施例的染色体变异的检测装置中正常数据集构建单元的结构示意图;
图13是本申请一实施例的染色体变异的检测装置中参考基因对比能力确定单元的结构示意图;
图14是本申请一实施例的染色体变异的检测装置中正常样本相关性单元的结构示意图;
图15是本申请一实施例的染色体变异的检测装置中群体区域校正单元的结构示意图;
图16是本申请一实施例的染色体变异的检测装置中矩阵单元的结构示意图;
图17是本申请一实施例的染色体变异的检测装置中待测样本校正单元的结构示意图;
图18是本申请一实施例的染色体变异的检测装置中第四深度校正单元的结构示意图;
图19是本申请一实施例的染色体变异的检测装置中分割单元的结构示意图;
图20是本申请一实施例的染色体变异的检测装置中检测单元的结构示意图;
图21为一实例中待测样本各染色体的对数发生比的图像曲线;
图22为一实例中9号染色体的对数发生比的图像曲线;
图23为一实例中21号染色体的对数发生比的图像曲线;
图24为一实例中18号染色体的对数发生比的图像曲线;
图25为一实例中10号染色体的对数发生比的图像曲线。
具体实施方式
针对现有技术方案存在的问题,本发明克服现有数据校正方法中的缺陷,降低了现有技术中由于采用同一批次样本进行对照而导致染色体异常漏检的几率;解决不同批次样本之间的偏向性所导致的检测结果的假阳性和假阴性结果;解决同时检测染色体非整体性检测(包括常染色体异常和性染色体异常)和染色体拷贝数变异检测的问题;提高染色体非整倍性嵌合体的检出效果;提高对10Mb以下以及在低游离核酸比例时的染色体拷贝数变异的检出效果;降低了数据的偏向性以及由此产生的检测结果的假阳性和假阴性率。
术语
样本:本发明所述样本为含有核酸的生物样本。
正常样本:本发明所述正常样本为经羊水穿刺或绒毛膜取样检测发现其核型正常,并用现有技术判定其不存在染色体数目变异与拷贝数变异的样本。
读段:本发明所述读段为高通量测序中一个反应所获得的核酸测序序列,也称为reads。
窗口:本发明所述窗口为根据需要在参考基因组上划分的具有固定大小值的若干区段。比如,500bp的窗口、2kbp的窗口等。
窗口深度:本发明所述窗口深度为比对到该窗口的读段数目乘以读段长度,再除以窗口的长度。上述计算窗口深度的公式可以预设于计算机中,统计时可以根据该计算公式直接获得窗口深度数值。
相同位置的窗口:本发明所述相同位置的窗口为各个不同样本比对到参考基因组上的同一区段所在的窗口。
片段:本发明所述片段为染色体上长度不等的核酸序列。
片段深度:本发明所述片段深度为片段内的读段数目乘以读段长度,再除以片段的长度。上述计算片段深度的公式可以预设于计算机中,统计时可以根据该计算公式直接获得片段深度数值。
重复区域:本发明所述重复区域为核酸序列中存在串联重复序列的区域。
样本内的校正:本发明所述样本内的校正为一个样本内部所有核酸测序数据的校正。
样本间的校正:本发明所述样本间的校正为不同样本之间所有核酸测序数据的校正。
群体区域的校正:本发明所述群体区域的校正为在同一参考基因组区段上的群体样本的核酸测序数据的校正。
正常数据集:本发明所述正常数据集为无染色体数目变异与拷贝数变异的样本的核酸测序数据的集合。
下面通过若干实施例并结合附图对本申请作进一步地说明。
请参照图1,本实施例公开了一种染色体变异的检测方法,包括待测样本测序步骤S000、待测样本测序数据校正步骤S300、分割步骤S500和检测步骤S700。下面具体说明。
步骤S000:对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果。在一实施例中,所述待测样本为来自于孕妇的外周血。在一实施例中,所述核酸为DNA。在一 实施例中,所述测序为第二代高通量测序,比如采用BGISEQ-50测序平台。
在待测样本测序数据校正步骤S300中需要使用正常数据集来对步骤S000中的测序结果进行校正,需要说明的是,该正常数据集可以是在步骤S000对待测样本进行测序之前就被提前构建好并预设在计算机系统中,使用时直接调用即可;该正常数据集也可以是在步骤S000对待测样本完成测序后,再进行构建。在一实施例中,染色体变异的检测方法还可以包括步骤S100,下面具体说明。
步骤S100:使用若干正常样本的核酸测序数据建立正常数据集。在一实施例中,可以用200个正常样本的核酸测序数据来建立正常数据集。请参照图2,在一实施例中,步骤S100包括步骤S110~S170。
步骤S110:将参考基因组连续划分成若干具有固定长度的第一窗口,并确定各第一窗口的比对能力值。具体地,请参照图3,在一实施例中,步骤S110可以包括步骤S111~S117。
步骤S111:将参考基因组打断成若干具有相同长度的读段再将这些读段比对回上述参考基因组。根据不同的测序平台,选择不同的读段长度。读段的长度通常为25-200bp。例如,将参考基因组打断成35bp大小的读段,再将这些读段比对回参考基因组。
步骤S113:将上述参考基因组连续划分成若干具有固定长度的第一窗口,其中第一窗口的长度大于所述读段的长度。例如,每一窗口的长度为500bp,即将参考基因组连续划分成若干500bp的不重叠的第一窗口。
步骤S115:统计位于各第一窗口中的读段的数量,并将读段的数量小于预定数量的第一窗口删除;和/或,计算各第一窗口中的重复区域比例,并将重复区域比例大于预定比例(例如20%)的第一窗口删除。其中,预定数量通常为正常样本数目乘以0.01所得的数值。
步骤S117:对于参考基因组中各未被删除的第一窗口,计算所有未被删除的第一窗口的平均读段数量,并将平均读段数量分别除以各未被删除的第一窗口的读段数量,以分别获得各未被删除的第一窗口的比对能力值(即ratio值)。
步骤S130:将参考基因组连续划分成若干具有固定长度的第二窗口,确定各正常样本内GC含量与第二窗口深度的相关性,对于每一个第二窗口,利用该第二窗口的GC含量对该第二窗口的深度进行样本内与样本间的校正。具体地,请参照图4,在一实施例中,步骤S130可以包括步骤S131~S135。
步骤S131:将上述各正常样本的测序数据比对到上述参考基因组中,并对各正常样本的读段进行比对能力值的校正。例如,将200个正常样本的测序数据比对到参考基因组中,进行比对能力值的校正,在一实施例中,比对能力值的校正,可以是对正常样本的每条读段都赋予其所在参考基因组对应窗口的比对能力值。
步骤S133:将参考基因组连续划分成若干具有固定长度的第二窗口,对于每个正常样本,统计它的各个第二窗口的深度以及GC含量,获得每个正常样本内的GC含量与窗口深度的相关性;并对每个第二窗口,根据该相关性与第二窗口的GC含量,利用回归模型,对该第二窗口的深度进行样本内的校正。例如,将参考基因组连续划分成若干长度为500kbp的不重叠的 第二窗口,统计每个正常样本的每个第二窗口的深度及CG含量,从而获得每个正常样本内GC含量与深度之间的相关性;利用LOESS回归模型,依据每个第二窗口的GC含量与该相关性,对每个第二窗口的深度进行样本内的校正;在一实施例中,步骤S133中对每个第二窗口的深度进行样本内的校正,即校正后的深度等于校正前的深度除以校正系数,校正系数由LOESS回归模型对每个正常样本内GC含量与深度之间的相关性进行回归得出。
步骤S135:对于进行样本内校正后的所有正常样本,统计所有正常样本的第二窗口的GC含量及深度,获得所有正常样本的整体的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据该相关性与所述第二窗口的GC含量,利用回归模型,对该第二窗口的深度进行样本间的校正。例如,统计200个正常样本经过步骤S133校正后的所有第二窗口的GC含量及深度,获得200个正常样本整体的GC含量和深度之间的相关性文件;再次利用LOESS回归模型,对每个样本每个第二窗口的深度进行样本间的校正。在一实施例中,步骤S135中对每个第二窗口的深度进行样本间的校正,即校正后的深度等于校正前的深度除以校正系数,其中校正系数由LOESS回归模型对200个正常样本整体的GC含量和深度之间的相关性进行回归得出。
步骤S150:将参考基因组连续划分成若干具有固定长度的第三窗口,根据各第三窗口的平均深度值对各第三窗口的深度进行校正。具体地,请参照图5,在一实施例中,步骤S150可以包括步骤S151和S153。
步骤S151:将参考基因组连续划分成若干具有固定长度的第三窗口,统计所有正常样本的各相同位置的第三窗口深度的平均值及方差,并计算所有正常样本的各相同位置的第三窗口的CV值,将CV值大于一预定值的第三窗口删除,其中各相同位置的第三窗口的CV值等于该窗口深度方差除以该窗口平均深度值。例如,将参考基因组连续划分成若干长度为100kbp的不重叠的第三窗口,统计200个正常样本的每个相同位置的第三窗口深度的平均值及方差,从而获得每个第三窗口的CV值,其中任意一第三窗口的CV值等于这200个正常样本中该相同位置的第三窗口深度的方差除以该窗口平均深度值;将CV值大于一预定值(例如0.25)的第三窗口删除,因为这说明该第三窗口波动性很大,不稳定。
步骤S153:利用所有未被删除的第三窗口深度平均值,对各未被删除的第三窗口的深度进行校正。在一实施例中,步骤S153中对任一第三窗口的深度进行校正,可以是将该相同位置第三窗口平均深度值除以该第三窗口的深度,获得该第三窗口校正后的深度。
步骤S170:将参考基因组连续划分成若干具有固定长度的第四窗口,根据各第四窗口的深度建立一矩阵,根据该矩阵对各第四窗口的深度进行校正。具体地,请参照图6,在一实施例中,步骤S170可以包括步骤S171和S173。
步骤S171:将参考基因组连续划分成若干具有固定长度的第四窗口,根据各第四窗口的深度建立一矩阵,并对该矩阵进行主成分分析,获得该矩阵的特征向量矩阵。例如,将上述参考基因组连续划分成若干长度为500kbp的不重叠的第四窗口,对200个正常样本每个第四窗口经步骤S153校正后的深度构成的矩阵进行主成分分析,即计算获得其特征向量矩阵。
步骤S173:对每个正常样本进行主成分分析,将每个正常样本的前预设数量个主成分删 除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。例如,对每个正常样本本进行主成分分析后,将前十个主成分删除,从而可以去除很多影响因素,这些影响因素包括不同批次样本间的偏向性,样本来源的不同环境,及其它噪音等;这样之后,可以获得PCA(主成分分析,Principal Component Analysis)校正后的每个第四窗口的深度文件。
步骤S300:使用上述正常数据集对待测样本进行校正。请参照图7,在一实施例中,步骤S300采用5步骤校正,其包括步骤S310~S390。
步骤S310:将待测样本的测序数据比对到参考基因组中,对待测样本的各读段进行比对能力值的校正。在一实施例中,比对能力值的校正,可以是对待测样本的每条读段都赋予其所在参考基因组对应窗口的比对能力值。
步骤S330:统计各第二窗口的深度以及GC含量,获得待测样本内的GC含量与窗口深度的相关性;并对每个第二窗口,根据该相关性与该窗口的GC含量,利用回归模型对该第二窗口的深度进行样本内的校正。步骤S330用于对待测样本的第二窗口深度进行样本内的校正,具体地,步骤S330可以为:采用500kbp的第二窗口,统计待测样本全基因组范围内所有第二窗口的深度及其GC含量,获得其相关性;利用LOESS回归模型与该相关性,对每个第二窗口的深度进行样本内的校正。
步骤S350:根据正常样本整体的GC含量与第二窗口深度的相关性,利用回归模型对待测样本的经步骤S330校正后的各第二窗口深度进行样本间的校正。步骤S350是用于对待测样本进行样本间的校正,具体地,步骤S350可以为:利用200个正常样本数据获得的整体窗口深度与GC含量相关性文件,对待测样本的经步骤S330校正后的每个第二窗口深度进行样本间的校正,依然使用LOESS回归模型。
步骤S370:读取经步骤S350校正后的待测样本的各第三窗口深度,根据正常样本的第三窗口的平均深度值对待测样本的各第三窗口深度进行校正。例如,利用200个正常样本数据获得的具有稳定深度的区域信息文件,对待测样本的经步骤S350校正后的每个第三窗口深度进行校正,即将步骤S153获得的正常样本中各未被删除的第三窗口的平均深度除以该待测样本经步骤S350校正后的每个对应第三窗口的深度,获得校正后的待测样本每个对应第三窗口的深度。
步骤S390:读取经步骤S370校正后的待测样本的各第四窗口深度,并根据各正常样本的第四窗口的深度建立的所述矩阵对待测样本的各第四窗口深度进行校正。具体地,请参照图8,在一实施例中,步骤S390可以包括步骤S391和S393。
步骤S391:将参考基因组连续划分成若干具有固定长度的不重叠的第四窗口,根据各第四窗口的深度建立一矩阵,并对该矩阵进行主成分分析,获得该矩阵的特征向量矩阵。当步骤S171存在时,则步骤S391可以省略。
步骤S393:读取经步骤S370校正后的待测样本的各第四窗口深度,将待测样本各第四窗口的深度乘以上述矩阵的特征向量矩阵,获得待测样本的主成分,将待测样本的前预设数量个主成分删除,再乘以上述特征向量矩阵的逆矩阵,获得待测样本的主成分分析校正后的各第四窗口的深度。
步骤S500:对校正后的待测样本的测序数据进行分割获得若干数据片段。请参照图9,在一实施例中,步骤S500包括步骤S510~S550。
步骤S510:对经步骤S393校正后的待测样本的测序数据进行分割获得若干具有相同拷贝数的片段。例如,利用二元分割算法(具体过程请参考文献Olshen AB,Venkatraman ES,Lucito R,Wigler M(2004)Circular binary segmentation for the analysis of array-based DNA copy number data.Biostatistics 5:557–572.)对经步骤S393校正后的待测样本数据进行分割,获得具有相同拷贝数的片段。
步骤S530:对于每一个片段,计算该片段的z值,其中z值=(待测样本该片段的深度-正常样本在该片段对应片段的平均深度)/正常样本在该片段对应片段的方差。
步骤S550:将z值的绝对值大于一预定值的片段标记为潜在拷贝数变异片段。
步骤S700:检测各数据片段是否为拷贝数变异片段。请参照图10,在一实施例中,步骤S700包括步骤S710和S730。
步骤S710:对于每一个潜在拷贝数变异片段,计算该潜在拷贝数变异片段的对数发生比以及该潜在拷贝数变异片段所在染色体的对数发生比。
步骤S730:当一潜在拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将该潜在拷贝数变异片段标记为拷贝数变异片段。在一实施例中,所述拷贝数变异(CNVs)片段为微缺失片段、或微重复片段、或其组合。例如,利用LOG Odds RATIO的统计学方法检验潜在的拷贝数变异片段是否为真假:计算每个潜在拷贝数变异片段的对数发生值(LOG Odds RATIO值),同时计算该片段所在的染色体的对数发生值(LOG Odds RATIO值),当染色体的LOG Odds RATIO值大于0,片段的LOG Odds RATIO值小于0时,认为潜在拷贝数变异片段为拷贝数变异片段。对数生发值的计算如下:
Figure PCTCN2017075858-appb-000001
其中f为待测样本的游离核酸比例,参照专利“确定生物样本中游离核酸比例的方法,装置及其用途”(申请号:PCT/CN2015/085109)所公开的方法计算游离核酸比例;Z为z值,参照上述步骤S530所公开的z值计算公式来计算,其中计算“潜在拷贝数变异片段所在染色体的对数发生比”时,将所述染色体看作z值计算公式中的片段。P(affected|Z,f)和P(euploid|Z,f)分别为一定Z值和游离核酸比例下,该片段为CNVs和正常区域的后验概率。P(affected)和P(euploid)分别为该片段为CNVs或正常区域的先验概率。P(Z|affected,f)和P(Z|euploid,f)为在一定游离核酸比例下,该片段为CNV或正常区域的条件概率。
本申请还公开了一种染色体变异的检测装置,请参照图11,其包括待测样本测序单元000、待测样本校正单元300、分割单元500和检测单元700。
待测样本测序单元000用于对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果。在一实施例中,所述待测样本为来自于孕妇的外周血。在一实施例中,所述核酸为DNA。在一实施例中,所述测序为第二代高通量测序,比如采用BGISEQ-50测序平台。
在待测样本校正单元300中需要使用正常数据集来对待测样本测序单元000中的测序结果进行校正,需要说明的是,该正常数据集可以是在待测样本测序单元000对待测样本进行测序之前就被提前构建好并预设在计算机系统中,使用时直接调用即可;该正常数据集也可以是在待测样本测序单元000对待测样本完成测序后,再进行构建。在一实施例中,染色体变异的检测装置还可以包括正常数据集构建单元100,下面具体说明。正常数据集构建单元100用于用若干正常样本的核酸测序数据建立正常数据集。在一实施例中,正常数据集构建单元100可以用200个正常样本的核酸测序数据来建立正常数据集。请参照图12,在一实施例中,正常数据集构建单元100包括参考基因对比能力确定单元110、正常样本相关性单元130、群体区域校正单元150和矩阵单元170。
参考基因对比能力确定单元110用于将参考基因组连续划分成若干具有固定长度的第一窗口,并确定各第一窗口的比对能力值。具体地,请参照图13,在一实施例中,参考基因对比能力确定单元110可以打断单元111、第一窗口单元113、第一删除单元115和第一比对能力校正单元117。
打断单元111,用于将参考基因组打断成若干具有相同长度的读段,再将这些读段比对回所述参考基因组。根据不同的测序平台,选择不同的读段长度。读段的长度通常为25-200bp。例如,打断单元111将参考基因组打断成35bp大小的读段,再将这些读段比对回参考基因组。
第一窗口单元113,该第一窗口单元113与打断单元111相连,用于将所述参考基因组连续划分成若干具有规定长度的所述第一窗口,其中所述第一窗口的长度大于所述读段的长度。例如,第一窗口单元113中每一窗口的长度为500bp,即将参考基因组连续划分成若干500bp的不重叠的窗口。
第一删除单元115,该第一删除单元115与第一窗口单元113相连,用于统计位于各第一窗口中的读段的数量,并将读段的数量小于一预定数量的第一窗口删除;和/或,计算各第一窗口中的重复区域比例,并将重复区域比例大于一预定比例(例如20%)的第一窗口删除。其中,预定数量通常为正常样本数目乘以0.01所得的数值。
第一比对能力校正单元117,该第一比对能力校正单元117与第一删除单元115相连,用于对于参考基因组中各未被删除的第一窗口,计算所有未被删除的第一窗口的平均读段数量,并将平均读段数量分别除以各未被删除的第一窗口的读段数量,以分别获得各未被删除的第一窗口的比对能力值(即ratio值)。
正常样本相关性单元130,正常样本相关性单元130与参考基因对比能力确定单元110相连,用于将参考基因组连续划分成若干具有固定长度的第二窗口,确定各正常样本内GC含量与第二窗口深度的相关性,对于每一个第二窗口,利用该第二窗口的GC含量对该第二窗口的 深度进行样本内与样本间的校正。具体地,请参照图14,在一实施例中,正常样本相关性单元130可以包括第二比对能力校正单元131、正常样本内窗口深度校正单元133和正常样本间整体窗口深度校正单元135。
第二比对能力校正单元131,用于将上述各正常样本的测序数据比对到上述参考基因组中,并对各正常样本的读段进行比对能力值的校正。例如,第二比对能力校正单元131将200个正常样本的测序数据比对到参考基因组中,进行比对能力值的校正,在一实施例中,比对能力值的校正,可以是对正常样本的每条读段都赋予其所在参考基因组对应窗口的比对能力值。
正常样本内窗口深度校正单元133,该正常样本内窗口深度校正单元133与第二比对能力校正单元131相连,用于将参考基因组连续划分成若干所述第二窗口,对于每个正常样本,统计它的各个第二窗口的深度以及GC含量,获得每个正常样本内的GC含量与窗口深度的相关性;并对每个第二窗口,根据该相关性与第二窗口的GC含量,利用回归模型对该第二窗口的深度进行样本内的校正。例如,正常样本内窗口深度校正单元133将参考基因组连续划分成若干长度为500kbp的不重叠的第二窗口,统计每个正常样本的每个第二窗口的深度及CG含量,从而获得每个正常样本内GC含量与深度之间的相关性;利用LOESS回归模型,依据每个第二窗口的GC含量与该相关性,对每个第二窗口的深度进行样本内的校正;在一实施例中,正常样本内窗口深度校正单元133对每个第二窗口的深度进行样本内的校正,即校正后的深度等于校正前的深度除以校正系数,校正系数由LOESS回归模型对每个正常样本内GC含量与深度之间的相关性进行回归得出。
正常样本间整体窗口深度校正单元135,该正常样本间整体窗口深度校正单元135与正常样本内窗口深度校正单元133相连,用于对于进行样本内窗口深度校正后的所有正常样本,统计所有正常样本的第二窗口的GC含量及深度,获得所有正常样本的整体的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据该相关性与所述第二窗口的GC含量,利用回归模型,对该第二窗口的深度进行样本间的校正。例如,正常样本间整体窗口深度校正单元135统计200个正常样本经过窗口深度校正单元133校正后的所有第二窗口的GC含量及深度,获得200个正常样本整体的GC含量和深度之间的相关性文件;再次利用LOESS回归模型,对每个样本每个第二窗口的深度进行样本间的校正。在一实施例中,正常样本间整体窗口深度校正单元135中对每个第二窗口的深度进行样本间的校正,即校正后的深度等于校正前的深度除以校正系数,其中校正系数由LOESS回归模型对200个正常样本整体的GC含量和深度之间的相关性进行回归得出。
群体区域校正单元150,用于将参考基因组连续划分成若干具有固定长度的第三窗口,根据各第三窗口的平均深度值对各第三窗口的深度进行校正。具体地,请参照图15,在一实施例中,群体区域校正单元150可以包括第二删除单元151和第一深度校正单元153。
第二删除单元151用于将上述参考基因组连续划分成若干具有固定长度的所述第三窗口,统计所有正常样本的各相同位置的第三窗口深度的平均值及方差,并计算所有正常样本的各相同位置的第三窗口的CV值,将CV值大于一预定值的第三窗口删除,其中各相同位置的第三窗口的CV值等于该窗口深度方差除以该窗口平均深度值。例如,第二删除单元151将参考基 因组连续划分成若干长度为100kbp的不重叠的第三窗口,统计200个正常样本的每个相同位置的第三窗口深度的平均值及方差,从而获得每个第三窗口的CV值,其中任意一第三窗口的CV值等于这200个正常样本中该相同位置的第三窗口深度的方差除以该窗口平均值;将CV值大于一预定值(例如0.25)的第三窗口删除,因为这说明该第三窗口波动性很大,不稳定。
第一深度校正单元153用于利用所有未被删除的第三窗口的平均深度值,对各未被删除的第三窗口的深度进行校正。在一实施例中,第一深度校正单元153中对任一第三窗口的深度进行校正,可以是将该相同位置第三窗口平均深度值除以该第三窗口的深度,获得该第三窗口校正后的深度。
矩阵单元170用于将参考基因组连续划分成若干具有固定长度的第四窗口,根据各第四窗口的深度建立一矩阵,根据该矩阵对各第四窗口的深度进行校正。具体地,请参照图16,在一实施例中,矩阵单元170可以包括第一主成分分析单元171和第二深度校正单元173。
第一主成分分析单元171用于将所述参考基因组连续划分成若干所述第四窗口,根据各第四窗口的深度建立一矩阵,并对该矩阵进行主成分分析,获得该矩阵的特征向量矩阵。例如,第一主成分分析单元171将上述参考基因组连续划分成若干长度为500kbp的不重叠的第四窗口,对200个正常样本每个第四窗口经第一深度校正单元153校正后的深度构成的矩阵进行主成分分析,即计算获得其特征向量矩阵。
第二深度校正单元173用于对每个正常样本进行主成分分析,将每个正常样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。例如,第二深度校正单元173对每个正常样本本进行主成分分析后,将前十个主成分删除,从而可以去除很多影响因素,这些影响因素包括不同批次样本间的偏向性,样本来源的不同环境,及其它噪音等;这样之后,可以获得PCA(主成分分析,Principal Component Analysis)校正后的每个第四窗口的深度文件。
待测样本校正单元300用于使用上述正常数据集对待测样本进行校正。请参照图17,在一实施例中,待测样本校正单元300采用5单元校正,其包括第三比对能力校正单元310、待测样本内窗口深度校正单元330、样本间校正单元350、第三深度校正单元370和第四深度校正单元390。
第三比对能力校正单元310用于将待测样本的测序数据比对到参考基因组中,对待测样本的各读段进行比对能力值的校正。在一实施例中,比对能力值的校正,可以是对待测样本的每条读段都赋予其所在参考基因组对应窗口的比对能力值。
待测样本内窗口深度校正单元330用于统计各第二窗口的深度以及GC含量,获得待测样本内的GC含量与窗口深度的相关性;并对每个第二窗口,根据该相关性与该窗口的GC含量,利用回归模型对该第二窗口的深度进行样本内的校正。待测样本内窗口深度校正单元330是用于对待测样本的第二窗口深度进行样本内的校正,具体地,待测样本内窗口深度校正单元330可以为:采用500kbp的第二窗口,统计待测样本全基因组范围内所有第二窗口的深度及其GC含量,获得其相关性;利用LOESS回归模型与该相关性,对每个第二窗口的深度进行样本内 的校正。
样本间校正单元350用于根据正常样本整体的GC含量与第二窗口深度的相关性,利用回归模型对待测样本的经待测样本内窗口深度校正单元330校正后的各第二窗口深度进行样本间的校正。样本间校正单元350是用于对待测样本进行样本间GC含量的校正,具体地,样本间校正单元350可以为:利用200个正常样本数据获得的整体窗口深度与GC含量相关性文件,对待测样本的经待测样本内窗口深度校正单元330校正后的每个第二窗口深度进行样本间的校正,依然使用LOESS回归模型。
第三深度校正单元370用于读取经样本间校正单元350校正后的待测样本的各第三窗口深度,根据正常样本的第三窗口的平均深度值对待测样本的各第三窗口深度进行校正。例如,第三深度校正单元370利用200个正常样本数据得到的具有稳定深度的区域信息文件,对待测样本的经样本间校正单元350校正后的每个第三窗口深度进行校正,即将第一深度校正单元153获得的正常样本中各未被删除的第三窗口的平均深度除以该待测样本经样本间校正单元350校正后的每个对应第三窗口的深度,获得校正后的待测样本每个对应第三窗口的深度。
第四深度校正单元390用于读取经第三深度校正单元370校正后的待测样本的各第四窗口深度,并根据各正常样本的第四窗口的深度建立的所述矩阵对待测样本的各第四窗口深度进行校正。具体地,请参照图18,在一实施例中,第四深度校正单元390可以包括矩阵建立单元391和主成分校正深度单元393。
矩阵建立单元391用于将参考基因组连续划分成若干具有固定长度的不重叠的第四窗口,根据各第四窗口的深度建立一矩阵,并对该矩阵进行主成分分析,获得该矩阵的特征向量矩阵。当第一主成分分析单元171存在时,则矩阵建立单元391可以省略。
主成分校正深度单元393用于读取经第三深度校正单元370校正后的待测样本的各第四窗口深度,并将待测样本各第四窗口的深度乘以所述特征向量矩阵,获得待测样本的主成分,将待测样本的前预设数量个(例如十个)主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。
分割单元500用于对校正后的待测样本的测序数据进行分割获得若干数据片段。请参照图19,在一实施例中,分割单元500包括相同拷贝数单元510、z值计算单元530和潜在拷贝数变异片段标记单元550。
相同拷贝数单元510用于对经待测样本校正单元300校正后的待测样本的测序数据进行分割获得若干具有相同拷贝数的片段。例如,相同拷贝数单元510利用二元分割算法(具体过程请参考文献Olshen AB,Venkatraman ES,Lucito R,Wigler M(2004)Circular binary segmentation for the analysis of array-based DNA copy number data.Biostatistics 5:557–572.)对经主成分校正深度单元393校正后的待测样本数据进行分割,获得具有相同拷贝数的片段。
z值计算单元530用于对于每一个片段,计算该片段的z值,其中z值=(待测样本该片段的深度-正常样本在该片段对应片段的平均深度)/正常样本在该片段对应片段的方差;
潜在拷贝数变异片段标记单元550用于将z值的绝对值大于一预定值的片段标记为潜在拷贝数变异片段。
检测单元700用于检测各数据片段是否为拷贝数变异片段。请参照图20,在一实施例中,检测单元700包括对数发生比计算单元710和拷贝数变异片段确定单元730。
对数发生比计算单元710用于对于每一个潜在拷贝数变异片段,计算该拷贝数变异片段的对数发生比以及该该潜在拷贝数变异片段所在染色体的对数发生比。
拷贝数变异片段确定单元730用于当一潜在拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将该潜在拷贝数变异片段标记为拷贝数变异片段。在一实施例中,所述拷贝数变异(CNVs)片段为微缺失片段、或微重复片段、或其组合。例如,拷贝数变异片段确定单元730利用LOG Odds RATIO的统计学方法检验潜在的拷贝数变异片段是否为真假:计算每个潜在拷贝数变异片段的对数发生值(LOG Odds RATIO值),同时计算该片段所在的染色体的对数发生值(LOG Odds RATIO值),当染色体的LOG Odds RATIO值大于0,片段的LOG Odds RATIO值小于0时,认为潜在拷贝数变异片段为拷贝数变异片段;其中对数生发值的计算如下:
Figure PCTCN2017075858-appb-000002
其中f为待测样本的游离核酸比例,参照专利“确定生物样本中游离核酸比例的方法,装置及其用途”(申请号:PCT/CN2015/085109)所公开的方法计算游离核酸比例;Z为z值,参照上述步骤S530所公开的z值计算公式来计算,其中计算“潜在拷贝数变异片段所在染色体的对数发生比”时,将所述染色体看作z值计算公式中的片段。P(affected|Z,f)和P(euploid|Z,f)分别为一定Z值和游离核酸比例下,该片段为CNVs和正常区域的后验概率。P(affected)和P(euploid)分别为该片段为CNVs或正常区域的先验概率。P(Z|affected,f)和P(Z|euploid,f)为在一定游离核酸比例下,该片段为CNV或正常区域的条件概率。
以上就是本申请公开的染色体变异的检测方法及装置,其采用同一批次样本和一定数量的正常样本作为对照,降低染色体异常漏检的可能性;在对待测样本所采用的5步校正法,尤其是主成分校正方法,能有效去除不同批次数据之间存在的偏向性;采用的联合片段检验方法(计算潜拷贝数变异片段的对数发生和其所在染色体的对数发生,当潜拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将该潜在拷贝数变异片段标记为拷贝数变异片段)能有效降低假阳性和假阴性;与现有的技术相比拓展了检测的适用范围,对染色体非整倍性和染色体拷贝数变异具有更高的检测准确性,且能在低游离核酸比例条件下检 测出片段更小的染色体拷贝数变异。
为了更好地理解本申请,下面再以一个例子进行说明。
取200例孕妇血浆正常样本()用于构建正常数据集,每例样本测序数据量5M,读段35bp。15例待检阳性临床孕妇血浆样本,按照BGISEQ-500测序仪的操作说明书进行文库构建与测序,获得每例样本测序数据量5M,读段35bp(依据羊水穿刺或绒毛膜取样发现核型异常,且依据现有技术判断其染色体存在拷贝数变异)。
首先就是用正常样本构建正常数据集。
(1)将参考基因组打断成35bp大小的读段,再用软件(例如BWA,Burrows-Wheeler Aligner)比对回参考基因组;将全基因组连续划分成500bp的窗口,统计每个窗口的唯一比对的读段数目,将比对率很低(如低于0.01)的窗口删除;分析每个窗口重复序列的覆盖情况(重复序列文件参考repeatMasker),将重复区域大于20%的窗口删除。
(2)对未被删除的窗口,将所有未被删除的窗口的平均读段数目除以每个未被删除的窗口的读段数目获得衡量每个未被删除的窗口的比对能力的ratio值。
(3)将200个正常样本的测序数据比对到参考基因组中,进行比对能力值的校正,即对正常样本的每条读段都赋予其所在参考基因组对应窗口的比对能力值。
(4)计算每条读段的GC含量,采用500kbp的窗口,统计每个样本每个窗口的深度及GC含量,从而获得每个样本内GC含量与深度之间的相关性;利用LOESS回归模型,依据每个窗口的GC含量,对每个窗口的深度进行样本内的校正。即校正后的深度等于校正前的深度除以校正系数,校正系数由LOESS回归模型对每个正常样本内GC含量与深度之间的相关性进行回归得出。
(5)统计200个样本经校正后的所有窗口的GC含量及深度,获得200个样本的群体GC含量和深度之间的相关性文件;再次利用LOESS回归模型,对每个样本每个窗口的深度进行GC含量的校正。
(6)采用100kbp的窗口,统计200个样本每个相同位置窗口深度的平均值及方差,从而获得所有样本的每个相同位置窗口的CV值,将CV值大于0.25的窗口,即波动性很大的不稳定的窗口删除。
(7)对于未被删除的窗口,利用每个未被删除的窗口深度的平均值对未被删除的每个样本每个窗口的深度进行校正。
(8)采用500kbp的窗口,对步骤(7)校正后的200个样本每个窗口的深度构成的矩阵进行主成分分析(PCA)获得其特征向量矩阵;对每个样本进行主成分分析,将前十个主成分删除,然后获得PCA校正后的每个窗口的深度文件。
其次就是对待测样本进行校正。
(1)比对能力的校正:将待测样本的测序数据比对到参考基因组,对每条读段进行比对能力的校正。即对待测样本的每条读段都赋予其所在参考基因组对应窗口的比对能力值。
(2)样本内窗口深度的校正:采用500kbp的窗口,统计待测样本全基因组范围内所有窗 口的深度及其GC含量,获得其相关性;利用LOESS回归模型与该相关性,对每个窗口的深度进行样本内的校正。
(3)样本间窗口深度的校正:利用200个正常样本数据获得的群体窗口深度与GC含量相关性文件(即利用正常样本构建正常数据集的方法步骤5获得的文件),对待测样本的经步骤(2)校正后的每个窗口深度进行校正,依然使用LOESS回归模型。
(4)群体区域校正:采用100kbp的窗口,利用200个正常样本数据获得的具有稳定深度的区域信息文件(即利用正常样本构建正常数据集的方法步骤7获得的文件),对待测样本的经步骤(3)校正后的每个窗口深度进行校正,即将步骤7获得的正常样本中各未被删除的窗口的平均深度除以该待测样本每个对应窗口的深度。
(5)PCA校正:采用500kbp窗口,读取经步骤(4)校正后的待测样本的窗口深度信息,与由200个正常样本获得的特征向量矩阵(利用正常样本构建正常数据集的方法步骤8获得的信息)相乘,获得待测样本的主成分,将前十个主成分删除后,再乘以特征向量矩阵的逆矩阵获得PCA校正后的每个窗口的深度信息文件。具体步骤可参考文献:Chen Zhao,John Tynan,Mathias Ehrich et al.Detection of Fetal Subchromosomal Abnormalities by Sequencing Circulating Cell-Free DNA from Maternal Plasma.Clinical Chemistry 61:4 608-616,2015。
最后就是对校正后的待测样本进行拷贝数变异检测。
(1)利用二元分割算法对校正后的数据进行片段分割,获得具有相同拷贝数的片段。二元分割算法的具体方法可参考文献:Olshen AB,Venkatraman ES,Lucito R,Wigler M(2004)Circular binary segmentation for the analysis of array-based DNA copy number data.Biostatistics 5:557–572。
(2)计算每个片段的深度,并与200个正常样本在该片段的平均深度和方差进行计算,获得该片段的z值,即z值=(待测样本该片段的深度-正常样本在该片段对应片段的平均深度)/正常样本在该片段对应片段的方差。z值的绝对值大于3的片段将是潜在的拷贝数变异片段,将进行进一步的分析。
(3)利用LOG Odds RATIO的统计学方法检验潜在的拷贝数变异片段是否为真假:计算每个片段的LOG Odds RATIO值,同时计算该片段所在的染色体的LOG Odds RATIO值,当染色体的LOG Odds RATIO值大于0,片段的LOG Odds RATIO值小于0,且片段的z值大于3时,认为该片段属于拷贝数变异,具体地,该拷贝数变异类型为微缺失或微重复。
对数生发值的计算如下:
Figure PCTCN2017075858-appb-000003
其中f为待测样本的游离核酸比例,参照专利“确定生物样本中游离核酸比例的方法,装置及其用途”(申请号:PCT/CN2015/085109)所公开的方法计算游离核酸比例;Z为z值,参 照上述步骤S530所公开的z值计算公式来计算,其中计算“潜在拷贝数变异片段所在染色体的对数发生比”时,将所述染色体看作z值计算公式中的片段。P(affected|Z,f)和P(euploid|Z,f)分别为一定Z值和游离核酸比例下,该片段为CNVs和正常区域的后验概率。P(affected)和P(euploid)分别为该片段为CNVs或正常区域的先验概率。P(Z|affected,f)和P(Z|euploid,f)为在一定游离核酸比例下,该片段为CNV或正常区域的条件概率。
检测结果如下:
请参照图21,为待测样本的对数发生比(logRatio)的图像,即待测样本经过数据校正后,每条染色体每个窗口的读段数目与该样本全基因组范围内的平均读段数目的比值的logRatio值。
请参照图22,为9号染色体的logRatio曲线,其中横坐标为染色体9的索引值(index),纵坐标为该待测样本的logRaito值;图中的点表示该待测样本在9号染色体每个窗口的logRaito值;黑色线为经过二元分割算法获得的片段,其中位于0参考线下方的一条黑色片段为一个微缺失发生的区域。
请参照图23,为21号染色体的logRatio曲线,其中横坐标为染色体21的index值,纵坐标为该待测样本的logRaito值;图中的点表示该待测样本在21号染色体每个窗口的logRaito值;黑色线为经过二元分割算法获得的片段,其中位于0参考线上方的一条黑色片段为一个微重复发生的区域。
请参照图24,为18号染色体的logRatio曲线图,其中横坐标为染色体18的index值,纵坐标为该待测样本的logRaito值;图中的点表示该待测样本在18号染色体每个窗口的logRaito值;黑色线为经过二元分割算法获得的片段,其中位于0参考线上方的黑色片段为微重复发生的区域,可见该样本为18染色体3体。
请参照图25,为10号染色体的logRatio曲线图,其中横坐标为染色体10的index值,纵坐标为该待测样本的logRaito值;图中的点表示该待测样本在10号染色体每个窗口的logRaito值;黑色线为经过二元分割算法获得的片段,其中位于0参考线上方的黑色片段为微重复发生的区域,该样本在染色体10的拷贝数异常升高,但未达到非整倍体的阈值,检测结果为染色体10三体的嵌合体。
因此,上述检测出1例18号染色体三体;2例16号染色体三体;一例XO;3例染色体三体嵌合;8例染色体微缺失/重复,其中6例的微缺失/重复片段小于10M,最低为1.1M。以上检测结果均进行了羊水或脐带血测序验证,与本申请检测结果完全一致。
通过以上实例可知,本申请能检测更高精度的拷贝数变异,如1M以下的拷贝数变异;在更低的游离核酸比例下,如小于5%的游离核酸比例,准确检测出拷贝数变异。
本申请公开的染色体变异的检测方法及装置,其可以包括人类或动物疾病诊断用途和非诊断用途;以非诊断用途为例,本申请公开的染色体变异的检测方法及装置可以应用于科学研究, 此外还可以应用于植物染色体变异检测,其中植物染色体变异可以表现为植物的遗传性状变化。
以上应用了具体个例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本领域的一般技术人员,依据本发明的思想,可以对上述具体实施方式进行变化。

Claims (39)

  1. 一种染色体变异的检测方法,其特征在于,包括:
    (1)对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果;
    (2)使用正常数据集对所述测序结果进行校正;
    (3)对校正后的测序结果进行分割,获得若干数据片段;以及
    (4)检测所述若干数据片段是否为拷贝数变异片段。
  2. 如权利要求1所述的方法,其特征在于,使用若干正常样本的测序数据建立所述正常数据集。
  3. 如权利要求1或2所述的方法,其特征在于,所述待测样本为外周血。
  4. 如权利要求3所述的方法,其特征在于,所述外周血为来自于孕妇的外周血。
  5. 如权利要求1所述的方法,其特征在于,所述测序为高通量测序。
  6. 如权利要求1所述的方法,其特征在于,所述核酸为DNA。
  7. 如权利要求1所述的方法,其特征在于,所述拷贝数变异为微缺失、微重复或其组合。
  8. 如权利要求2所述的方法,其特征在于,所述使用若干正常样本的测序数据建立正常数据集包括:
    (0-1)将参考基因组连续划分成若干第一窗口,并确定各第一窗口的比对能力值;
    (0-2)将参考基因组连续划分成若干第二窗口,确定各正常样本内GC含量与第二窗口深度的相关性,对于每一个所述第二窗口,利用所述第二窗口的GC含量对所述第二窗口的深度进行样本内样本间的校正;
    (0-3)将参考基因组连续划分成若干第三窗口,根据各正常样本间相同位置的第三窗口的平均深度值对各第三窗口的深度进行群体区域的校正;以及
    (0-4)将参考基因组连续划分成若干第四窗口,根据各所述第四窗口的深度建立一矩阵,根据所述矩阵对各第四窗口的深度进行校正。
  9. 如权利要求8所述的方法,其特征在于,步骤(0-1)包括:
    (0-1-1)将参考基因组打断成若干相同长度的读段,再将所述读段比对回所述参考基因组;
    (0-1-2)将所述参考基因组连续划分成若干所述第一窗口,其中所述第一窗口的长度大于所述读段的长度;
    (0-1-3)统计位于各第一窗口中的读段的数量,并将读段的数量小于一预定数量的第一窗口删除;和/或,计算各第一窗口中的重复区域比例,并将重复区域比例大于一预定比例的第一窗口删除;以及
    (0-1-4)对于参考基因组中各未被删除的第一窗口,计算所述未被删除的第一窗口的平均读段数量,并将平均读段数量分别除以各未被删除的第一窗口的读段数量,以分别获得各未被删除的第一窗口的比对能力值。
  10. 如权利要求8所述的方法,其特征在于,步骤(0-2)包括:
    (0-2-1)将所述各正常样本的测序数据比对到所述参考基因组中,并对各正常样本的读段进行比对能力值的校正;
    (0-2-2)将所述参考基因组连续划分成若干所述第二窗口,对于每个正常样本,统计其各 第二窗口的深度以及GC含量,获得每个正常样本内的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本内的校正;以及
    (0-2-3)对于进行样本内校正后的所有正常样本,统计所有正常样本的第二窗口的GC含量及深度,获得所有正常样本的整体的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本间的校正。
  11. 如权利要求10所述的方法,其特征在于,步骤(0-2-2)所述回归模型为LOESS回归模型。
  12. 如权利要求8所述的方法,其特征在于,步骤(0-3)包括:
    (0-3-1)将所述参考基因组连续划分成若干所述第三窗口,统计所有正常样本的每个相同位置的所述第三窗口深度的平均值及方差,并计算每个相同位置的所述第三窗口的CV值,将所述CV值大于一预定值的第三窗口删除;以及
    (0-3-2)计算所有未被删除的第三窗口的平均深度值,并用所述平均深度值对各未被删除的第三窗口的深度进行校正。
  13. 如权利要求12所述的方法,其特征在于,所述每个相同位置的所述第三窗口的CV值等于所述第三窗口深度的方差除以平均值。
  14. 如权利要求8所述的方法,其特征在于,步骤(0-4)包括:
    (0-4-1)将所述参考基因组连续划分成若干所述第四窗口,根据各所述第四窗口的深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
    (0-4-2)对每个所述正常样本进行主成分分析,将每个正常样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各所述窗口的深度。
  15. 如权利要求1或8所述的方法,其特征在于,步骤(2)包括:
    (2-1)将所述待测样本的测序数据比对到参考基因组中,对所述待测样本的各读段进行比对能力值的校正;
    (2-2)统计各所述第二窗口的深度以及GC含量,获得所述待测样本内的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型对所述第二窗口的深度进行样本内的校正;
    (2-3)根据正常样本中GC含量与第二窗口深度的相关性,利用回归模型对所述待测样本的经步骤(2-2)校正后的各第二窗口深度进行样本间的校正;
    (2-4)读取经步骤(2-3)校正后的待测样本的各第三窗口深度,根据正常样本的第三窗口的平均深度值对待测样本的各第三窗口的深度进行校正;以及
    (2-5)读取经步骤(2-4)校正后的待测样本的各第四窗口深度,根据各正常样本的第四窗口的深度建立的所述矩阵对待测样本的各第四窗口的深度进行校正。
  16. 如权利要求15所述的方法,其特征在于,步骤(2-2)所述回归模型为LOESS回归模型。
  17. 如权利要求15所述的方法,其特征在于,步骤(2-5)包括:
    (2-5-1)根据正常样本的各所述第四窗口的深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
    (2-5-2)读取经步骤(2-4)校正后的待测样本的各第四窗口深度,并将待测样本各第四窗口的深度乘以所述特征向量矩阵,获得待测样本的主成分,将待测样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。
  18. 如权利要求1或8所述的方法,其特征在于,步骤(3)包括:
    (3-1)对校正后的测序结果进行分割获得若干具有相同拷贝数的片段;
    (3-2)对于每一个所述片段,计算所述片段的z值,其中z值=(待测样本所述片段的深度-正常样本在所述片段对应片段的平均深度)/正常样本在所述片段对应片段的方差;以及
    (3-3)将z值的绝对值大于一预定值的片段标记为潜在拷贝数变异片段。
  19. 如权利要求18所述的方法,其特征在于,步骤(3-3)所述预定值为3。
  20. 如权利要求1或8所述的方法,其特征在于,步骤(4)包括:
    (4-1)对于每一个所述潜在拷贝数变异片段,计算所述潜在拷贝数变异片段的对数发生比以及所述潜在拷贝数变异片段所在染色体的对数发生比;以及
    (4-2)当一潜在拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将所述潜在拷贝数变异片段标记为拷贝数变异片段。
  21. 如权利要求20所述的方法,其特征在于,步骤(4-2)所述预定值为0。
  22. 一种染色体变异的检测装置,其特征在于,包括:
    待测样本测序单元,所述待测样本测序单元用于对含有核酸的待测样本进行测序,获得由若干测序数据构成的测序结果;
    待测样本校正单元,所述待测样本校正单元与待测样本测序单元相连,并且用于使用正常数据集对所述测序结果进行校正;
    分割单元,所述分割单元与所述待测样本校正单元相连,并用于对校正后的测序结果进行分割,获得若干数据片段;以及
    检测单元,所述检测单元与所述分割单元相连,并用于检测所述若干数据片段是否为拷贝数变异片段。
  23. 如权利要求22所述的装置,其特征在于,还包括正常数据集构建单元,所述正常数据集构建单元与待测样本校正单元相连,用于用若干正常样本的测序数据建立正常数据集。
  24. 如权利要求22所述的装置,其特征在于,所述待测样本为外周血。
  25. 如权利要求24所述的装置,其特征在于,所述外周血为来自于孕妇的外周血。
  26. 如权利要求22所述的装置,其特征在于,所述测序为高通量测序。
  27. 如权利要求22所述的装置,其特征在于,所述核酸为DNA。
  28. 如权利要求22所述的装置,其特征在于,所述拷贝数变异为微缺失、微重复或其组合。
  29. 如权利要求23所述的装置,其特征在于,所述正常数据集构建单元包括:
    参考基因对比能力确定单元,用于将参考基因组连续划分成若干第一窗口,并确定各第一窗口的比对能力值;
    正常样本相关性单元,所述正常样本相关性单元与所述参考基因对比能力确定单元相连,用于将参考基因组连续划分成若干第二窗口,确定各正常样本内GC含量与第二窗口深度的相关性,对于每一个所述第二窗口,利用所述第二窗口的GC含量对所述第二窗口的深度进行样本内与样本间的校正;
    群体区域校正单元,所述群体区域校正单元与所述正常样本相关性单元相连,用于将参考基因组连续划分成若干第三窗口,根据各正常样本间相同位置的第三窗口的平均深度值对各第三窗口的深度进行群体区域的校正;以及
    矩阵单元,所述矩阵单元与所述群体区域校正单元相连,用于将参考基因组连续划分成若干第四窗口,根据各所述第四窗口的深度建立一矩阵,根据所述矩阵对各第四窗口的深度进行校正。
  30. 如权利要求29所述的装置,其特征在于,所述参考基因对比能力确定单元包括:
    打断单元,用于将参考基因组打断成若干相同长度的读段,再将所述读段比对回所述参考基因组;
    第一窗口单元,所述第一窗口单元与所述打断单元相连,用于将所述参考基因组连续划分成若干所述第一窗口,其中所述第一窗口的长度大于所述读段的长度;
    第一删除单元,所述第一删除单元与所述第一窗口单元相连,用于统计位于各第一窗口中的读段的数量,并将读段的数量小于一预定数量的第一窗口删除;和/或,计算各第一窗口中的重复区域比例,并将重复区域比例大于一预定比例的第一窗口删除;以及
    第一比对能力校正单元,所述第一比对能力校正单元与所述第一删除单元相连,用于对于参考基因组中各未被删除的第一窗口,计算所述未被删除的第一窗口的平均读段数量,并将平均读段数量分别除以各未被删除的第一窗口的读段数量,以分别获得各未被删除的第一窗口的比对能力值。
  31. 如权利要求29所述的装置,其特征在于,所述正常样本相关性单元包括:
    第二比对能力校正单元,用于将所述各正常样本的测序数据比对到所述参考基因组中,并对各正常样本的读段进行比对能力值的校正;
    正常样本内窗口深度校正单元,所述正常样本内窗口深度校正单元与第二比对能力校正单元相连,用于将参考基因组连续划分成若干所述第二窗口,对于每个正常样本,统计其各第二窗口的深度以及GC含量,获得每个正常样本内的GC含量与窗口深度的相关性;并根据所述相关性与所述第二窗口的GC含量,利用回归模型对所述第二窗口的深度进行校正;以及
    正常样本间整体窗口深度校正单元,所述正常样本整体窗口深度校正单元与所述正常样本内窗口深度校正单元相连,用于对于进行样本内窗口深度校正后的所有正常样本,统计所有正常样本的第二窗口的GC含量及深度,获得所有正常样本的整体的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本间的校正。
  32. 如权利要求29所述的染色体变异的检测装置,其特征在于,所述群体区域校正单元包括:
    第二删除单元,用于将所述参考基因组连续划分成若干所述第三窗口,统计所有正常样本 的每个相同位置的所述第三窗口深度的平均值及方差,并计算每个相同位置的所述第三窗口的CV值,将所述CV值大于一预定值的第三窗口删除,其中所述每个相同位置的所述第三窗口的CV值等于所述第三窗口深度的方差除以平均值;以及
    第一深度校正单元,所述第一深度校正单元与所述第二删除单元相连,用于计算所有未被删除的所述第三窗口的平均深度值,并用所述平均深度值对各未被删除的所述第三窗口的深度进行校正。
  33. 如权利要求29所述的装置,其特征在于,所述矩阵单元包括:
    第一主成分分析单元,用于将所述参考基因组连续划分成若干所述第四窗口,根据各所述第四窗口的深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
    第二深度校正单元,所述第二深度校正单元与所述第一主成分分析单元相连,用于对每个所述正常样本进行主成分分析,将每个正常样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各所述窗口的深度。
  34. 如权利要求22或23所述的装置,其特征在于,所述待测样本校正单元,包括:
    第三比对能力校正单元,用于将所述待测样本的测序数据比对到参考基因组中,对所述待测样本的各读段进行比对能力值的校正;
    待测样本内窗口深度校正单元,所述待测样本内窗口深度校正单元与所述第三比对能力校正单元相连,用于统计各所述第二窗口的深度以及GC含量,获得所述待测样本内的GC含量与窗口深度的相关性;并对每个所述第二窗口,根据所述相关性与所述第二窗口的GC含量,利用回归模型,对所述第二窗口的深度进行样本内的校正;
    样本间校正单元,所述样本间校正单元与所述待测样本内窗口深度校正单元相连,用于根据正常样本中GC含量与第二窗口深度的相关性,利用回归模型对待测样本的经所述待测样本内窗口深度校正单元校正后的各第二窗口深度进行样本间的校正;
    第三深度校正单元,所述第三深度校正单元与所述样本间校正单元相连,用于读取经样本间校正单元校正后的待测样本的各第三窗口深度,根据正常样本的第三窗口的平均深度值对所述待测样本的各所述第三窗口的深度进行校正;以及
    第四深度校正单元,所述第四深度校正单元与所述第三深度校正单元相连,用于读取经第三深度矫正单元校正后的待测样本的各第四窗口深度,根据各正常样本的第四窗口的深度建立的所述矩阵对所述待测样本的各所述第四窗口的深度进行校正。
  35. 如权利要求34所述的装置,其特征在于,所述第四深度校正单元包括:
    矩阵建立单元,用于根据所述正常样本的各第四窗口的深度建立一矩阵,并对所述矩阵进行主成分分析,获得所述矩阵的特征向量矩阵;以及
    主成分校正深度单元,所述主成分校正深度单元与所述矩阵建立单元相连,用于读取经所述第三深度校正单元校正后的待测样本的各第四窗口深度,将所述待测样本各第四窗口的深度乘以所述特征向量矩阵,获得待测样本的主成分,将待测样本的前预设数量个主成分删除,再乘以所述特征向量矩阵的逆矩阵,获得主成分分析校正后的各窗口的深度。
  36. 如权利要求22或23所述的装置,其特征在于,所述分割单元包括:
    相同拷贝数单元,用于对经待测样本校正单元校正后的测序结果进行数据分割获得若干具有相同拷贝数的片段;
    z值计算单元,所述z值计算单元与所述相同拷贝数单元相连,用于对于每一个所述片段,计算所述片段的z值,其中z值=(待测样本所述片段的深度-正常样本在所述片段对应片段的平均深度)/正常样本在所述片段对应片段的方差;以及
    潜在拷贝数变异片段标记单元,所述潜在拷贝数变异片段标记单元与所述z值计算单元相连,用于将z值的绝对值大于一预定值的片段标记为潜在拷贝数变异片段。
  37. 如权利要求36所述的装置,其特征在于,所述预定值为3。
  38. 如权利要求22或23所述的装置,其特征在于,所述检测单元包括:
    对数发生比计算单元,用于对于每一个潜在拷贝数变异片段,计算该潜在拷贝数变异片段的对数发生比以及该该潜在拷贝数变异片段所在染色体的对数发生比;以及
    拷贝数变异片段确定单元,用于当一潜在拷贝数变异片段的对数发生小于一预定值且其所在染色体的对数发生比大于一预定值时,将该潜在拷贝数变异片段标记为拷贝数变异片段。
  39. 如权利要求38所述的装置,其特征在于,所述预定值为0。
PCT/CN2017/075858 2017-03-07 2017-03-07 一种染色体变异的检测方法及装置 WO2018161245A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/075858 WO2018161245A1 (zh) 2017-03-07 2017-03-07 一种染色体变异的检测方法及装置
CN201780085820.7A CN110268044B (zh) 2017-03-07 2017-03-07 一种染色体变异的检测方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/075858 WO2018161245A1 (zh) 2017-03-07 2017-03-07 一种染色体变异的检测方法及装置

Publications (1)

Publication Number Publication Date
WO2018161245A1 true WO2018161245A1 (zh) 2018-09-13

Family

ID=63447180

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/075858 WO2018161245A1 (zh) 2017-03-07 2017-03-07 一种染色体变异的检测方法及装置

Country Status (2)

Country Link
CN (1) CN110268044B (zh)
WO (1) WO2018161245A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111916150A (zh) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 一种基因组拷贝数变异的检测方法和装置
CN112712853A (zh) * 2020-12-31 2021-04-27 北京优迅医学检验实验室有限公司 一种无创产前检测装置
CN113046430A (zh) * 2021-03-15 2021-06-29 北京阅微基因技术股份有限公司 一种染色体非整倍体数目异常的扩增组合物及其应用
CN114220481A (zh) * 2021-11-25 2022-03-22 深圳思勤医疗科技有限公司 基于全基因组测序完成待测样本的核型分析的方法、系统和计算机可读介质
CN115132271A (zh) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法
CN115762633A (zh) * 2022-11-23 2023-03-07 哈尔滨工业大学 一种基于三代测序的基因组结构变异基因型校正方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994647A (zh) * 2022-04-25 2023-11-03 天津华大基因科技有限公司 用于分析变异检测结果的模型的构建方法
CN114792548B (zh) * 2022-06-14 2022-09-09 北京贝瑞和康生物技术有限公司 校正测序数据、检测拷贝数变异的方法、设备和介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102952877A (zh) * 2012-08-06 2013-03-06 深圳华大基因研究院 检测α珠蛋白基因拷贝数的方法和系统
CN104789686A (zh) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 检测染色体非整倍性的试剂盒和装置
CN105408496A (zh) * 2013-03-15 2016-03-16 夸登特健康公司 检测稀有突变和拷贝数变异的系统和方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682224B (zh) * 2011-03-18 2015-01-21 深圳华大基因科技服务有限公司 检测拷贝数变异的方法和装置
CN104136628A (zh) * 2011-10-28 2014-11-05 深圳华大基因医学有限公司 一种检测染色体微缺失和微重复的方法
CN104221022B (zh) * 2012-04-05 2017-11-21 深圳华大基因股份有限公司 一种拷贝数变异检测方法和系统
CN104603284B (zh) * 2012-09-12 2016-08-24 深圳华大基因研究院 利用基因组测序片段检测拷贝数变异的方法
CN104745718B (zh) * 2015-04-23 2018-02-16 北京中仪康卫医疗器械有限公司 一种检测人类胚胎染色体微缺失和微重复的方法
CN105574361B (zh) * 2015-11-05 2018-11-02 上海序康医疗科技有限公司 一种检测基因组拷贝数变异的方法
CN105349678A (zh) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 一种染色体拷贝数变异的检测方法
CN106520940A (zh) * 2016-11-04 2017-03-22 深圳华大基因研究院 一种染色体非整倍体和拷贝数变异检测方法及其应用
CN108268752B (zh) * 2018-01-18 2019-02-01 东莞博奥木华基因科技有限公司 一种染色体异常检测装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102952877A (zh) * 2012-08-06 2013-03-06 深圳华大基因研究院 检测α珠蛋白基因拷贝数的方法和系统
CN105408496A (zh) * 2013-03-15 2016-03-16 夸登特健康公司 检测稀有突变和拷贝数变异的系统和方法
CN104789686A (zh) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 检测染色体非整倍性的试剂盒和装置

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111916150A (zh) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 一种基因组拷贝数变异的检测方法和装置
CN112712853A (zh) * 2020-12-31 2021-04-27 北京优迅医学检验实验室有限公司 一种无创产前检测装置
CN112712853B (zh) * 2020-12-31 2023-11-21 北京优迅医学检验实验室有限公司 一种无创产前检测装置
CN113046430A (zh) * 2021-03-15 2021-06-29 北京阅微基因技术股份有限公司 一种染色体非整倍体数目异常的扩增组合物及其应用
CN113046430B (zh) * 2021-03-15 2022-02-01 北京阅微基因技术股份有限公司 一种染色体非整倍体数目异常的扩增组合物及其应用
CN114220481A (zh) * 2021-11-25 2022-03-22 深圳思勤医疗科技有限公司 基于全基因组测序完成待测样本的核型分析的方法、系统和计算机可读介质
CN114220481B (zh) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 基于全基因组测序完成待测样本的核型分析的方法、系统和计算机可读介质
CN115132271A (zh) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法
CN115762633A (zh) * 2022-11-23 2023-03-07 哈尔滨工业大学 一种基于三代测序的基因组结构变异基因型校正方法
CN115762633B (zh) * 2022-11-23 2024-01-23 哈尔滨工业大学 一种基于三代测序的基因组结构变异基因型校正方法

Also Published As

Publication number Publication date
CN110268044B (zh) 2022-08-02
CN110268044A (zh) 2019-09-20

Similar Documents

Publication Publication Date Title
WO2018161245A1 (zh) 一种染色体变异的检测方法及装置
CN108573125B (zh) 一种基因组拷贝数变异的检测方法及包含该方法的装置
IL249095B1 (en) Detection of subchromosomal aneuploidy in the fetus and variations in the number of copies
CN112669901A (zh) 基于低深度高通量基因组测序的染色体拷贝数变异检测装置
CN108920899B (zh) 一种基于目标区域测序的单个外显子拷贝数变异预测方法
CN108256296B (zh) 数据处理装置
Concordet et al. A new approach for the determination of reference intervals from hospital-based data
JP6623400B2 (ja) 染色体異数性を測定するためのキット、装置及び方法
JP2008511058A (ja) コンピュータシステムを用いるデータ品質および/または部分異数染色体の決定
CN107133491B (zh) 一种获取胎儿游离dna浓度的方法
WO2021232388A1 (zh) 确定胚胎细胞染色体中预定位点碱基类型的方法及其应用
CN112365927B (zh) Cnv检测装置
JP7467504B2 (ja) 染色体異数性を判定するためおよび分類モデルを構築するための方法およびデバイス
WO2022110039A1 (zh) 一种胎儿染色体异常的检测方法与系统
KR101678962B1 (ko) 대규모 병렬형 게놈서열분석 방법을 이용한 비침습적 산전검사 장치 및 방법
JP6929778B2 (ja) 着床前遺伝子スクリーニングにおける一塩基多型を用いた品質管理方法
WO2024011929A1 (zh) 检测胎儿染色体非整倍体异常的方法、装置及存储介质
CN108229099B (zh) 数据处理方法、装置、存储介质及处理器
CN110191964B (zh) 确定生物样本中预定来源的游离核酸比例的方法及装置
WO2016176846A1 (zh) 检测染色体非整倍性的试剂盒、装置和方法
CN114267409A (zh) 无创产前基因检测测序数据的分析方法、装置及存储介质
CN109979535B (zh) 一种胚胎植入前遗传学筛查装置
CN110970089B (zh) 胎儿浓度计算的预处理方法、预处理装置及其应用
CN113710818A (zh) 病毒相关联的癌症风险分层
CN109686401B (zh) 一种识别异源低频基因组信号唯一性的方法及其应用

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17900163

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17900163

Country of ref document: EP

Kind code of ref document: A1