CNV detection device
The application is a divisional application of Chinese patent application with the application number of 201811623637.0, the application date of 2018, 12 and 28 months and the invention name of 'CNV detection device'.
Technical Field
The invention relates to a non-invasive CNV detection device and a method for non-invasively detecting CNV by using the same.
Background
Copy number variations (hereinafter abbreviated CNV) are a clinically important class of structural variations, with most microdeletions or microduplications being polymorphic, but some of them being pathogenic or lethal. Thus identifying a CNV with pathogenic lethality before birth and performing early intervention can reduce neonatal defects.
At present, noninvasive prenatal gene detection (hereinafter referred to as NIPT screening) is used for sequencing and analyzing maternal peripheral blood based on a next generation sequencing platform (NGS platform), and system noise is filtered and fetal signals are increased through an analysis means, so that chromosome aneuploidy is detected. Non-invasive CNV is based on NIPT to window chromosomes and perform signal amplification and significance check independently for each window.
Since most of the signal in the sequencing data comes from the mother, the fetal signal is easily masked when there is maternal CNV or placental mosaicism. On the other hand, when the experimental system is unstable, the result judgment is easily misaligned due to GC deviation or interference of system noise, and a false positive or false negative result occurs. Fetal concentration is also an important variable affecting outcome determination, with higher concentration giving higher confidence in the outcome.
Disclosure of Invention
In view of the above-described drawbacks of the prior art, an object of the present invention is to provide a detection apparatus and a detection method with higher detection sensitivity for CNVs.
Specifically, the object of the present invention is achieved by the following means.
1. A copy number variation detection apparatus, comprising:
the sequencing data acquisition module is used for sequencing based on the acquired maternal peripheral blood free DNA to obtain chromosome sequencing data of a sample to be detected and chromosome sequencing data from a background library sample;
a windowing fragmentation module, which is used for comparing the sequencing data to a reference genome sequence, cutting the sequencing data into windows with equal length, enabling an intersection to exist between every two adjacent windows, and counting window parameters of each window, wherein the window parameters comprise read, Unique Read (UR), capability, genomic GC and/or unique reads GC;
a module for detecting CNV based on reads, which calculates Z value based on each window, calculates CNV probability, and estimates fetal concentration by CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating interference of maternal CNV;
a module for detecting CNV based on unique reads number, wherein a sliding step length m is regulated according to the detection resolution, the module calculates average reads (Mr) and average GC (Mgc) based on adjacent m windows, and constructs a window specificity linear regression model, thereby judging whether the sample to be detected is suspected to be CNV;
and the model result summarizing module is used for comparing, analyzing and outputting a final result based on the output results of the two modules for detecting the CNV.
2. The detection apparatus according to item 1, wherein the module for detecting CNV based on reads number includes the following sub-modules:
a data pre-processing and normalization module for GC correction of the reads to eliminate inter-library differences; and carrying out homogenization correction after carrying out GC correction so as to enable all the samples to be detected and the background library samples to have comparability;
the Z test signal amplification module calculates the mean value and the variance of each window by using the background library samples and calculates the Z value of each window through Z test;
the chromosome slicing module is used for slicing the chromosome by using the continuity window Z value, combining the continuity windows with similar states into a to-be-detected interval and judging the attributes of the interval including dup, del and normal;
a module for calculating a confidence interval of the Z value, which is used for calculating the median of the Z value of a continuous window existing in the same interval of the background library sample aiming at each interval to be detected merged by the chromosome slicing module, calculating and setting a confidence interval range according to the mean value and the variance of the median distribution, judging whether the interval to be detected falls into the confidence interval or not, and judging the interval which does not fall into the confidence interval as a potential CNV interval;
a module for calculating the CNV probability, which calculates the summation of reads of windows in the interval in the same interval of the background library samples aiming at the potential CNV interval to obtain probability density distribution, calculates the significance probability according to the reads of the CNV interval to be detected, and carries out negative logarithm conversion on the significance probability and compares the significance probability with a given threshold value;
and a module for calculating the CNV concentration, wherein the module is used for fitting the UR and the real GC in the same interval of the background library sample aiming at the potential CNV interval to determine the UR and the GC in the potential CNV interval, calculating the CNV concentration by using the UR and the GC in the potential CNV interval, and judging whether the sample to be detected is suspected to be maternal CNV or placenta chimerism according to the comparison between the calculated CNV concentration and the real fetal concentration.
3. The detection apparatus according to item 1 or 2, wherein the module for detecting CNV based on unique reads number includes the following sub-modules:
a MiniModel construction module, which carries out pretreatment for eliminating the difference of data amount among different libraries, after the pretreatment, the step length m is regulated according to the resolution, every adjacent m windows are combined into a unit to calculate the average reads (Mr) and the average GC (Mgc), the distribution of Mr ' and Mgc ' in the same interval is calculated by using a background library sample, the Mr ' and Mgc ' are fitted, the residual error is calculated according to the theoretical value corresponding to the measured values Mr and Mgc, the attributes of windows including dup, del and normal are judged according to the residual error, the weight is calculated according to the correlation R, Mgc of Mr ' and Mgc ' and the standard deviation sd of the background data Mr ', and the confidence coefficient is judged;
a chromosome segmentation and slicing module, which utilizes a given model or algorithm to identify adjacent regions with significant differences from normal distribution of two different mean values, so as to perform segmentation and slicing processing on a chromosome and identify the boundary position of a CNV;
and the significance evaluation module randomly extracts the same number of window values from other areas of the chromosome of the sample to be tested aiming at the section interval, and repeats the process so as to determine the significance of the real value in the background distribution.
4. The detection apparatus according to item 3, wherein in the MiniModel building module, calculating the residual error according to the theoretical values corresponding to the measured values Mr and Mgc and determining the confidence level further includes:
for each unit, calculating the standard deviation of all background library samples Mr ', the Pearson correlation coefficient of Mr' and Mgc ', the quantile of the sample Mgc to be detected distributed on the background library sample Mgc', and integrating the standard deviation, the correlation coefficient and the quantile to calculate the weight, thereby judging the confidence.
5. The detection device according to any one of items 1 to 4, wherein in the model result summarizing module, if the sample to be detected has a part which is reported as a target CNV section in output results of a module for detecting CNV based on the reads number and the Z value and a module for detecting CNV based on the UR number and the mean value, and when it is determined that a coincidence rate of the target CNV section exceeds a set threshold value, the coincidence area is reported as a CNV, and if results of the two modules for the section to be detected are not consistent, a false positive result is output.
6. The detection apparatus according to any one of items 3 to 5, wherein the process is repeated 10000 times in the significance evaluation module.
7. A computer-readable storage medium having a computer program stored thereon, the computer program being configured to perform the steps of:
a sequencing data acquisition step, wherein sequencing is carried out on the basis of the acquired maternal peripheral blood free DNA so as to obtain chromosome sequencing data of a sample to be detected and chromosome sequencing data from a background library sample;
a windowing fragmentation step, which is used for comparing the sequencing data to a reference genome sequence, cutting the sequencing data into windows with equal length, enabling an intersection to exist between every two adjacent windows, and counting window parameters of each window, wherein the window parameters comprise read, Unique Read (UR), capability, genomic GC and/or unique reads GC;
a step of detecting CNV based on the reads number, which is to calculate a Z value based on each window, calculate the CNV probability, estimate the fetal concentration by using the CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV or not and eliminating the interference of maternal CNV;
the method comprises the steps of detecting CNV based on unique reads, specifying the length m of a sliding window according to resolution, calculating average reads (Mr) and average GC (Mgc) based on adjacent m windows, and constructing a window specificity linear regression model so as to judge whether a sample to be detected is suspected to be CNV;
and a model result summarizing step, namely comparing, analyzing and outputting a final result based on the output results of the two modules for detecting the CNV.
8. The computer-readable storage medium according to item 7, having stored thereon a computer program, wherein the computer program is further configured to perform the steps of:
a data pre-processing and normalization step for GC correction of the reads to eliminate inter-library differences; and carrying out homogenization correction after carrying out GC correction so as to enable all the samples to be detected and the background library samples to have comparability;
a Z test signal amplification step, which uses a background library sample to calculate the mean value and the variance of each window and calculates the Z value of each window through the Z test;
a chromosome slicing step, which is to slice the chromosome by using a continuity window Z value, combine the continuity windows with similar states into a section to be detected and judge the attributes of the section including dup, del and normal;
calculating a Z value confidence interval, namely calculating the median of Z values of continuous windows existing in the same interval of the background library samples aiming at each interval to be detected merged by the chromosome slicing module, calculating a 95% confidence interval range according to the mean value and the variance of the median distribution, judging whether the interval to be detected falls into the confidence interval or not, and judging the interval which does not fall into the confidence interval as a potential CNV interval;
calculating the CNV probability, namely calculating the sum of reads of windows in the same interval of the background library sample aiming at the potential CNV interval to obtain probability density distribution, calculating the significance probability according to the reads of the CNV interval to be detected, and performing negative logarithm conversion on the significance probability and comparing the significance probability with a given threshold value;
and calculating the CNV concentration, namely fitting the potential CNV interval by using UR and real GC in the same interval of the background library sample, determining UR and GC in the potential CNV interval, calculating the CNV concentration by using the UR and GC in the potential CNV interval, and judging whether the sample to be detected is suspected to be maternal CNV or placental chimerism according to the comparison between the calculated CNV concentration and the real fetal concentration.
9. The computer-readable storage medium according to item 7, having stored thereon a computer program, wherein the computer program is further configured to perform the steps of:
a MiniModel construction step, which carries out pretreatment for eliminating the difference of data amount among different libraries, after the pretreatment, the length m of a sliding window is regulated according to the resolution, every adjacent m windows are combined into a unit to calculate the average reads (Mr) and the average GC (Mgc), the Mr ' and Mgc ' distribution of the same interval are calculated by using a background library sample, the Mr ' and Mgc ' are fitted, the residual error is calculated according to the theoretical value corresponding to the measured values Mr and Mgc, the attributes including dup, del and normal of the window are judged according to the residual error, the weight is calculated according to the relevance R, Mgc of the Mr ' and Mgc ' and the standard difference sd of the background data Mr ', and the confidence coefficient is judged;
a chromosome segmentation and slicing step, wherein a given model or algorithm is used for identifying adjacent regions which are normally distributed from two different mean values and have significant difference, so that the chromosome is segmented and sliced, and the boundary position of the CNV is identified;
and a significance evaluation step of randomly extracting the same number of window values from other regions of the chromosome of the sample to be tested for the section, and repeating the process to determine the significance of the true value in the background distribution.
10. The computer-readable storage medium according to item 7, having stored thereon a computer program, wherein the computer program is further configured to perform the steps of:
if the output results of the two modules of the module for detecting the CNV based on the reads number and the Z value and the module for detecting the CNV based on the UR number and the mean value of the sample to be detected both report parts of a target CNV interval, when the coincidence rate of the target CNV interval is judged to exceed a set threshold value, the coincidence area is reported as the CNV, and if the results of the two modules for the interval to be detected are not consistent, a false positive result is output.
11. A method for detecting copy number variation, comprising the steps of:
a sequencing data acquisition step, wherein sequencing is carried out on the basis of the acquired maternal peripheral blood free DNA so as to obtain chromosome sequencing data of a sample to be detected and chromosome sequencing data from a background library sample;
a step of window fragmentation, which is to compare the sequencing data to a reference genome sequence, cut the sequencing data into windows with equal length, enable an intersection to exist between every two adjacent windows, and count window parameters of each window, including read, Unique Read (UR), capability, genomic GC and/or unique reads GC;
a step of detecting CNV based on reads, in which Z value is calculated based on each window, CNV probability is calculated, and fetal concentration is estimated by using the CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating the interference of maternal CNV;
detecting CNV based on unique reads, in the step, calculating average reads (Mr) and average GC (Mgc) based on 10 adjacent windows, and constructing a window specificity linear regression model so as to judge whether the sample to be detected is suspected to be CNV;
and a model result summarizing step, wherein a final result is output by comparison and analysis based on output results of the two modules for detecting the CNV.
12. The detection method according to item 11, wherein the step of detecting CNVs based on reads number includes the steps of:
a data pre-processing and normalization step for GC correction of the reads to eliminate inter-library differences; and carrying out homogenization correction after carrying out GC correction so as to enable all the samples to be detected and the background library samples to have comparability;
a Z test signal amplification step, which uses a background library sample to calculate the mean value and the variance of each window and calculates the Z value of each window through the Z test;
a chromosome slicing step, which is to slice the chromosome by using a continuity window Z value, combine the continuity windows with similar states into a section to be detected and judge the attributes of the section including dup, del and normal;
calculating a Z value confidence interval, namely calculating the median of Z values of continuous windows existing in the same interval of the background library samples aiming at each interval to be detected merged by the chromosome slicing module, calculating a 95% confidence interval range according to the mean value and the variance of the median distribution, judging whether the interval to be detected falls into the confidence interval or not, and judging the interval which does not fall into the confidence interval as a potential CNV interval;
calculating the CNV probability, namely calculating the sum of reads of windows in the same interval of the background library sample aiming at the potential CNV interval to obtain probability density distribution, calculating the significance probability according to the reads of the CNV interval to be detected, and performing negative logarithm conversion on the significance probability and comparing the significance probability with a given threshold value;
and calculating the CNV concentration, namely fitting the potential CNV interval by using UR and real GC in the same interval of the background library sample, determining UR and GC in the potential CNV interval, calculating the CNV concentration by using the UR and GC in the potential CNV interval, and judging whether the sample to be detected is suspected to be maternal CNV or placental chimerism according to the comparison between the calculated CNV concentration and the real fetal concentration.
13. The detection method according to claim 11 or 12, wherein the step of detecting a CNV based on unique reads number includes the steps of:
a MiniModel construction step, which carries out pretreatment for eliminating the difference of data amount among different libraries, after the pretreatment, the length m of a sliding window is regulated according to the resolution, every adjacent m windows are combined into a unit to calculate the average reads (Mr) and the average GC (Mgc), the Mr ' and Mgc ' distribution of the same interval are calculated by using a background library sample, the Mr ' and Mgc ' are fitted, the residual error is calculated according to the theoretical value corresponding to the measured values Mr and Mgc, the attributes including dup, del and normal of the window are judged according to the residual error, the weight is calculated according to the relevance R, Mgc of the Mr ' and Mgc ' and the standard difference sd of the background data Mr ', and the confidence coefficient is judged;
a chromosome segmentation and slicing step, wherein a given model or algorithm is used for identifying adjacent regions which are normally distributed from two different mean values and have significant difference, so that the chromosome is segmented and sliced, and the boundary position of the CNV is identified;
and a significance evaluation step of randomly extracting the same number of window values from other regions of the chromosome of the sample to be tested for the section, and repeating the process to determine the significance of the true value in the background distribution.
14. The detection method according to item 13, wherein, in the MiniModel construction step, calculating the residual error and determining the confidence level according to the theoretical values corresponding to the measured values Mr and Mgc further includes:
for each unit, calculating the standard deviation of all background library samples Mr ', the Pearson correlation coefficient of Mr' and Mgc ', the quantile of the sample Mgc to be detected distributed on the background library sample Mgc', and integrating the standard deviation, the correlation coefficient and the quantile to calculate the weight, thereby judging the confidence.
15. The detection method according to any one of items 11 to 14, wherein in the model result summarizing step, if the sample to be detected has a part that is reported as a target CNV section in output results of a module for detecting CNV based on the reads number and the Z value and a module for detecting CNV based on the UR number and the mean value, and when it is determined that a coincidence rate of the target CNV section exceeds a set threshold value, the coincidence area is reported as a CNV, and if results in the two modules for the section to be detected are not consistent, a result that is false positive is output.
16. The detection method according to any one of items 13 to 15, wherein the process is repeated 10000 times in a significance assessment module.
In the invention, N negative samples are adopted to establish a background library, and a sample to be detected (namely a fetus) is compared with the background library to carry out significance verification. In the device and the method, the sample to be detected and the background library are subjected to the same pretreatment process, and the method mainly comprises the following steps of chromosome windowing: each chromosome is cut into windows with equal length, and an intersection exists between every two adjacent windows; correction by lowessGC: each test chromosome was GC corrected together with chromosome 1 and/or 2. Wherein, the chromosomes 1 and 2 are relatively stable and have higher volume ratio and diversity, and can be used as a reference to effectively evaluate the deletion or the duplication of the chromosome to be detected. In addition, with reference to chromosomes 1 and 2, differences in the data amount of different libraries can be eliminated to some extent. For each window, the mean and variance in N negative samples were calculated in the background pool, and the signal was amplified through three Z-tests. Finally, the window with the Z value larger than 1 is considered to be repeated, the window with the Z value smaller than-1 is considered to be missing, and the other windows belong to normal fluctuation. And merging windows of the same category, finally calculating the fetal concentration aiming at the merged windows UR, and filtering false positive results caused by data fluctuation by combining the Z value and the fetal concentration. All CNVs are matched to DGV and OMIM databases, and annotation information corresponding to the CNVs is output, wherein the annotation information comprises polymorphism, pathogenicity and the like.
In the invention, the whole chromosome is cut into windows, so that the influence on the whole chromosome caused by local microdeletion or microduplication can be effectively avoided. The length of each window is equal and can be calculated according to the sequencing depth, for example, the number of free DNA fragments aligned to each window is not less than the reciprocal of the lower sequencing concentration limit. In the present invention, it is preferable that each window has a length of 100k, and an intersection of 50k exists between each two adjacent windows.
In the present invention, m may be any integer. The smaller M, the higher the resolution, but the stronger the bin fluctuation after each combination, and the lower the stability. The larger M, the lower the resolution, but the stronger the bin stability after combination, and the more significant the correlation between unique reads and GC. For example, M may range from any integer between 5 and 20, corresponding to a resolution of 0.25 to 1M.
In the present invention, the above-mentioned set threshold is used to evaluate the consistency of the two CNV detection modules. There may be some deviation for the CNV boundaries identified due to differences in the segmentation modules of the two CNV detection modules. The higher the set threshold value is, the stricter the requirement on the consistency of the two modules is; the more relaxed the opposite. In the present invention, the threshold is preferably set to 50%.
In the present invention, the set confidence interval may be a value or range commonly employed by those skilled in the art, such as 95% or 99%.
In the present invention, CNV boundaries are identified by chromosome segmentation, relying on a model or algorithm that segments sequence data for normal distributions of different means. CNV boundary information can be identified using the given modules described above because of the significant differences in the mean and adjacent chromosomal regions of the CNV regions.
The non-invasive CNV detection is different from NIPT chromosome aneuploidy detection, and under the condition that the experimental condition is unstable, system noise such as data fluctuation and the like is more likely to appear in the result in a false positive form. When the system is noisy, one of the main features is embodied in reads true GC bias, and this type of data fluctuation cannot be removed by using genomic GC correction.
As described above, the device according to the present invention detects sample autosomal and X chromosome microdeletion microreplication based on the NIPT platform. The invention provides a noninvasive CNV detection device with higher detection sensitivity, which can reduce the occurrence probability of false positive or false negative and greatly improve the accuracy and sensitivity of detecting the CNV of a fetus.
Drawings
Various other advantages and benefits of the present invention will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. Also, like parts are designated by like reference numerals throughout the drawings.
FIG. 1 shows a flow of data analysis performed by the detection apparatus of the present invention.
Fig. 2 is a graph showing the results of CNV determination using the method of comparative example 1.
Fig. 3 is a graph showing the results of CNV determination by the method of example 1.
Detailed Description
The present invention relates to the following definitions.
High-throughput sequencing: high-throughput sequencing, also known as "Next-generation" sequencing technology, is used to sequence hundreds of thousands to millions of DNA molecules in parallel at a time.
Window (sliding window): generally refers to a fixed length region on the genome.
Background library: a sample library consisting of N (generally regarded as > -20) healthy human samples.
And (5) reading: multiple of read, short sequencing fragment sequences generated by a high throughput sequencing platform.
Unique read: refers to reads that align uniquely to the genome. During sequencing, some reads can be aligned to multiple positions of the genome at the same time, and the Unique read filters out the multiple aligned reads from all non-dup reads, and the rest is Unique read.
Capability: for some windows, short sequences are less unique, probably due to repeats from large pieces of heterochromatin or more complex biological causes, when the efficiency of each window is calculated using the capability parameter and compared to the threshold 0.625, windows below the threshold are not brought into the calculation.
Genomic GC: this parameter represents the genomic GC for each window, which is identical in all libraries. In addition, in model one described below, this parameter is used for GC correction in order to correct for differences in reads due to GC bias.
Reads GC: GC corresponding to all reads in each window.
Unique reads GC: represents the GC corresponding to unique reads in each window, used to calculate the concentration of CNV in model one below; in model two below, unique reads GC was used to fit the background data for 10 consecutive windows of synthesized data points P, thereby calculating the residual for P.
Dup: duplicate region, representing the existence of 3 copies of the target CNV
Del: deletion, deletion of regions, representing the presence of a single copy of the CNV of interest
Normal: represents 2 copies of the normal
True GC: is defined relative to the native genomic GC. The real GC is GC corresponding to unique reads, and is sequence GC information which is truly embodied in the sequencing process and the experimental environment.
The invention is based on the NIPT platform of low-depth whole genome sequencing, and is used for detecting the sample autosomal chromosome and X chromosome microdeletion micro-repeat.
In one embodiment, the copy number variation detecting apparatus of the present invention includes:
the device comprises a sequencing data acquisition module, a windowing fragmentation module, a module for detecting CNV based on all reads, a module for detecting CNV based on unique reads and a model result summarizing module.
Firstly, a sequencing data acquisition module is used for sequencing based on acquired maternal peripheral blood free DNA to obtain chromosome sequencing data of a sample to be tested and chromosome sequencing data from a background library sample. The module is used for extracting, amplifying, establishing a library and sequencing mixed DNA in maternal peripheral blood based on SE 40. Finally, the chromosome is compared by an information analysis method, so that the information of the chromosome is analyzed. The methods for extracting, amplifying, pooling and sequencing the mixed DNA in the peripheral blood of the mother body can adopt the methods commonly used in the field.
In this embodiment, the number of the background library samples is not fixed, and can be determined according to different time periods, different reagents and different experimental conditions. For example, the background library sample comprises more than 1000 negative samples, preferably more than 2000 negative samples, preferably more than 3000 negative samples, preferably more than 3500 negative samples, and more preferably, for example, 4000 negative samples.
For the windowing fragmentation module, the module is used for aligning the sequencing data to a reference genome sequence, cutting the sequencing data into windows with equal length, enabling an intersection to exist between every two adjacent windows, and counting window parameters of each window, wherein the window parameters comprise read, Unique Read (UR), capability and/or unique reads GC.
In the present invention, there is no limitation on the reference genomic sequence, and any known reference sequence of the human genome can be used as long as it is ensured that the same set of sequences is used for all samples to be aligned. In a specific embodiment, the reference genomic sequence is the hg19 reference sequence.
As for a module for detecting CNVs based on all reads numbers, the module includes the following sub-modules and is used to execute the following model one.
The module for detecting the CNV based on all the reads number comprises the following sub-modules:
a data pre-processing and normalization module for GC correction of all reads to eliminate inter-library differences; and carrying out homogenization correction after carrying out GC correction so as to enable all the samples to be detected and the background library samples to have comparability;
the Z test signal amplification module calculates the mean value and the variance of each window by using the background library samples and calculates the Z value of each window through Z test;
the chromosome slicing module is used for slicing the chromosome by using the continuity window Z value, combining the continuity windows with similar states into a to-be-detected interval and judging the attributes of the interval including dup, del and normal;
a module for calculating a confidence interval of the Z value, which is used for calculating the median of the Z value of a continuous window existing in the same interval of the background library sample aiming at each interval to be detected merged by the chromosome slicing module, calculating the range of 95% confidence interval according to the mean value and variance of the median distribution, judging whether the interval to be detected falls into the confidence interval or not, and judging the interval which does not fall into the confidence interval as a potential CNV interval;
a module for calculating the CNV probability, which calculates the summation of all reads of the window in the interval in the same interval of the background library sample aiming at the potential CNV interval to obtain probability density distribution, calculates the significance probability according to all reads of the CNV interval to be detected, and carries out negative logarithm conversion on the significance probability and compares the significance probability with a given threshold value;
and a module for calculating the CNV concentration, wherein the module is used for fitting the UR and the real GC in the same interval of the background library sample aiming at the potential CNV interval to determine the UR and the GC in the potential CNV interval, calculating the CNV concentration by using the UR and the GC in the potential CNV interval, and judging whether the sample to be detected is suspected to be maternal CNV or placenta chimerism according to the comparison between the calculated CNV concentration and the real fetal concentration.
Model one
The first model comprises the following steps:
the method comprises a first step of data preprocessing and standardization, and further comprises the following substeps:
(1) GC correction
In the first model, the reads are subjected to GC correction by using a lowess algorithm, and in order to eliminate the difference between libraries and objectively evaluate the fluctuation condition of chromosomes, the correction is simultaneously carried out on any chromosome to be detected with chromosomes 1 and 2. The low incidence of both chromosomes 1 and 2, and the large GC coverage, increases the stability of the results when correcting lowess. The smoothing coefficient f was set to 0.67. The correction process adopts high-quality reads, namely unique reads/(capability +1) > -0.625, and then estimates the reads of the low-quality window by using the corrected overall mean and variance.
(2) Normalization correction
In order to make all samples to be tested and the reference sample have comparability, the model one estimates the corresponding variance according to the chromosome window reads (removing abnormal values) after GC correction, and divides the window reads of the chromosome to be tested by the standard deviation so as to correct the variance to the level of 1.
Herein, the purpose of GC correction is to correct GC bias inherent in the sequencing process, and reads at different positions on the chromosome tend to be at the same level after correction; the chromosomes 1 and 2 were used as background and corrected together with the chromosome to be tested in order to eliminate the inter-library variation. Because the data amount of different libraries is different, but the relative relationship between chromosomes in the libraries is stable, the data amount difference of different libraries can be eliminated to a certain extent by using chromosomes 1 and 2 as references.
Step two, Z test amplifying signal
The mean and variance of each window were calculated using the background library samples and the Z-value for each window was calculated by the Z-test. Each time the Z-test obtains a smaller variance by converging the data, thereby amplifying the signal, the Z-test process is repeated three times.
Step three, sliding the window to slice the chromosome
In order to identify the CNV intervals such as dup and del and other normal intervals from the chromosome to be detected, the model firstly uses the continuous window Z value to perform slicing processing on the chromosome. Here, with the sliding window method, continuity windows with similar states are merged into one section, and the attribute (dup, del, normal) of this section is further judged.
Step four, calculating a confidence interval of the Z value
For each interval after slicing, we calculate the median of the Z values of the consecutive windows in the interval at the same interval of the background library samples, and estimate the 95% confidence interval range from the mean and variance of the median distribution. If the interval to be measured falls within the confidence interval, the interval is considered as 2 normal copies, otherwise, the interval may be a potential CNV interval.
Step five, calculating the CNV probability
And for the potential CNV interval, calculating the sum of windows in the interval in the same interval of the background library sample to obtain probability density distribution, calculating the significance probability according to the reads of the CNV interval to be detected, and comparing the significance probability with a threshold value through negative logarithm conversion.
Wherein the negative log transformation calculates the significance probability P and compares it with a threshold. The threshold is defined by the lowest detection line of the positive sample, namely the threshold which can ensure that the CNV interval of the true positive sample is reported.
Step six, calculating the CNV concentration
For the interval where the CNV is located, a fitting line is calculated by using UR and real GC in the same interval of the background library sample, and the concentration is calculated by using UR and GC of the potential CNV. Comparing the CNV concentration with the real fetal concentration, and if the CNV concentration is obviously lower than the fetal concentration, considering that the CNV concentration is possibly false positive caused by data fluctuation or noise; if significantly higher than fetal concentration, maternal CNV or chimerism is suspected.
In this context the true fetal concentration can be determined as follows: for males, the true fetal concentration is calculated by the content of the Y chromosome; for female fetus, the estimated true concentration of CNV can be measured by the information of the gestational week and the weight of the mother, and the estimation method does not influence the identification of maternal CNV.
As for the module for detecting CNV based on unique reads number, the module includes the following sub-modules and is used to execute the following model two.
A MiniModel construction module, which carries out pretreatment for eliminating the difference of data amount among different libraries, specifies the length m of a sliding window according to the resolution after the pretreatment, combines and calculates average reads (Mr) and average GC (Mgc) for every adjacent m windows, calculates the distribution of Mr ' and Mgc ' in the same interval by using a background library sample, fits the Mr ' and Mgc ', calculates residual errors according to theoretical values corresponding to measured values Mr and Mgc, judges the attributes of the windows including dup, del and normal according to the residual errors, calculates the weight according to the correlation R, Mgc of Mr ' and Mgc ' and the standard difference sd of the background data Mr ', and judges the confidence coefficient;
a chromosome segmentation and slicing module, which utilizes a given model or algorithm to identify adjacent regions with significant differences from normal distribution of two different mean values, so as to perform segmentation and slicing processing on a chromosome and identify the boundary position of a CNV;
specifically, the module can utilize a Haarseg model to slice chromosomes to identify chromosome intervals with the same copy, and parameters of breakksFdrQ in the model are calculated in a self-adaptive mode through the model, namely, the parameters are gradually converged according to specified step length until the results of two cyclic slicing are consistent, and the model is stable, namely, the number of the slices is not changed any more;
and a significance evaluation module which randomly extracts the same number of window values from other regions of the chromosome of the sample to be tested for the section, and repeats the process, for example, 10000 times to determine the significance of the true value in the background distribution.
Model two
The second model comprises the following steps:
step one, constructing a MiniModel
For the chromosomes to be tested, to eliminate the difference in data size between different libraries, each window reads was divided by the median value of the chromosome 1 window reads. After preprocessing, the length m of a sliding window is regulated according to the resolution, average reads (Mr) and average GC (Mgc) are combined and calculated for each adjacent m windows, meanwhile, the distribution of the same interval Mr 'and Mgc' is calculated by using a background library sample, and a linear regression model is used for fitting. Calculating residual errors according to theoretical values corresponding to the measured values Mr and Mgc, wherein the larger the residual errors are, the more likely the m windows belong to dup; the smaller the residual, the more likely the m windows belong to del; the closer the residual is to 0, the more likely the m windows are to be the normal 2 copies; finally, a weight (weight) is calculated according to the correlation R, Mgc of Mr ' and Mgc ' and the standard deviation sd of the background data Mr ', and the higher the weight is, the higher the confidence is.
In detail, we first divide the number of Unique reads for all windows by the average number of Unique reads for chromosome 1, eliminating the difference in data size between samples. We then calculated the Mr (i.e., the average) of the corrected average Unique reads in the sample to be tested, and the average gc content Mgc of the corresponding region, using each of the 10 adjacent windows as a unit. Similarly, we calculate Mr ', Mgc' for the same region for each background library sample. According to Mr ', Mgc' vectors obtained by calculation from background library samples, fitting a fitting line corresponding to Mgc of the Mr target region through regression analysis, and converting the residual error of an observed value and a theoretical value into a concentration value, namely, the purpose of separating a fetal signal from a mixed signal is realized. However, due to the limitations of low-data sequencing technologies, and the preference of dna fragments in the sequencing process, the Unique reads are not uniformly distributed on the chromosome. This means that the residual of each cell is calculated directly through the fit line and is not fair for all cells. Therefore, we additionally calculate the standard deviation of all the background library samples Mr 'on each cell, the Pearson correlation coefficient of Mr' and Mgc ', the quantile of the distribution of the sample Mgc to be tested on the background library sample Mgc', and calculate the weight by integrating these three variables. The larger the standard deviation is, the smaller the correlation coefficient is, the closer the quantile is to the boundary, which indicates that the sequencing quality of the region corresponding to the unit is low, or the correlation between the Unique reads and the gc is weak, so that the confidence coefficient is lower, the obtained weight is smaller, and the influence of the low-confidence unit on other surrounding regions is eliminated. On the other hand, the cells with high confidence degree have a large correspondence weight, and therefore, the influence on the result judgment is also large.
All fragmented regions in step one were classified as dup repeats, del deletions, normal. Dup and Del are finally reported as CNV. Wherein fitting to the Mr 'and Mgc' distributions is an analysis of reference samples in a background library. That is, Mr 'and Mgc' of the same window interval are calculated using the reference samples.
For example, 1000 reference samples should be able to calculate 1000 Mr 'corresponding to 1000 Mgc' in the same interval, where the 1000 data points take Mgc 'as the horizontal axis and Mr' as the vertical axis to obtain the scatter distribution of the background, and a fit line can be obtained by using this distribution, and any position on the fit line represents the theoretical value of Mr 'corresponding to Mgc'.
Step two, chromosome segmentation and slicing
And a second model adopts a Haarseg model to slice the chromosome, and the parameter clearksFdrQ is calculated in a model self-adaptive manner, namely, the parameter clearsFdrQ is gradually converged according to the specified step length until the results of the two cyclic slicing are consistent, so that the model is stable.
The HaarSeg model is an analytical model for analyzing ArrayCGH, and is used for carrying out fragmentation differentiation on chromosomes and identifying chromosome intervals with the same copy. The larger the BreaksFdrQ, the higher the model resolution and the more slices; conversely, the lower the resolution, the fewer slices. With the change of the BreaksFdrQ, the number of the slices is changed, two adjacent cycles are guided, the number of the slices is not changed any more, the model is considered to be stable, but only one slice is needed, and the number of the slices is not changed under the influence of different BreaksFdrQ. For the HaarSeg model, reference may be made to, for example: http:// webe. technique. ac. il/Sites/peoples/yonina eldar/Info/software/haarseg. htm.
Step three, significance evaluation
For the section interval, the same number of window values are randomly extracted from other regions of the chromosome to be detected, and the process is repeated 10000 times, so that the significance of the true value in the background distribution is estimated.
As described above, model one counts the counts of all reads; model two counts are the counts of unique reads.
And the model result summarizing module performs comparison analysis on the output results of the two modules for detecting the CNV and outputs a final result.
Two model results summarization
According to the output results of the two models, if the target CNV interval is reported in the two models and the coincidence rate exceeds 50%, the coincidence area is reported as the CNV. Otherwise, the result of the interval to be detected in the two models is not consistent, and the interval to be detected may be a false positive result.
Examples
The present invention will be described more specifically with reference to the following examples, but the present invention is not limited to these examples.
In the following examples and comparative examples, peripheral blood of pregnant women who were sent to a hospital in beijing in 2017 at 1 month was used, the clinical examination of which was low in the risk of CNV, and which showed that normal infants without CNV had been produced in the following follow-up process.
Comparative example 1
And sequencing the sample to obtain the chromosome sequencing data of the sample to be tested and the chromosome sequencing data from the background library sample.
The sample was analyzed by the method described in Statistical Approach to depletion of the Error Rate used by the Material Detection technique (Published online 2015Nov 4.doi:10.1038/srep16106, PMCID: PMC4632076), and the analysis results shown in FIG. 2 were obtained by the following procedure with reference to the method described in this document. And judging that the sample is a 15 th chromosome long arm with a repetitive fragment according to the analysis result.
The basis for the above judgment is as follows: all windows were normalized so that the normal two copy regions were consistent with the background library signal and the residuals were normally distributed following a mean of 0. Thus with the 95% confidence interval as a threshold, a continuity window above the threshold tends to be multi-copy, and a continuity window below the threshold tends to be single-copy. The chromosome was sectioned by the HaarSeg algorithm (see: https:// academic. oup. com/bioinformatics/article/24/16/i139/199827 for HaarSeg algorithm), where the long-arm front of chromosome 15 was significantly above threshold and was therefore highly suspected to be a micro-repetitive CNV region.
Example 1
And sequencing the sample to obtain the chromosome sequencing data of the sample to be tested and the chromosome sequencing data from the background library sample.
Cutting the sequencing data of example 1 into 100 k-length windows with equal length, enabling 50k intersection to exist between every two adjacent windows, and counting window parameters of each window, wherein the window parameters comprise read, Unique Read (UR), capability, genomic GC and/or unique reads GC;
performing detection CNV based on reads, calculating Z value based on each obtained window, calculating CNV probability, and estimating fetal concentration by using CNV probability, thereby judging whether the sample to be detected is suspected to be positive CNV, and eliminating interference of maternal CNV; the analysis result in this step is shown in the model one diagram of fig. 3, and according to the result, the model one is displayed through forward and backward continuous difference calculation, wavelet analysis is combined for smooth noise reduction, potential CNV boundaries are identified, significance evaluation is performed on each potential CNV region, and through comparison between samples, the signal of the long arm front end of chromosome 15 is found to be not significant, so that two copies are judged to be normal.
Detecting CNV based on unique reads, calculating average reads (Mr) and average GC (Mgc) based on 10 adjacent windows by the module, and constructing a window specificity linear regression model so as to judge whether the sample to be detected is suspected to be CNV; the analysis result in this step is shown in the second model diagram in fig. 3, and it is shown that the second model extracts fetal signals by using Unique reads, combines with the Haarseg model slice, divides the area, adaptively defines a threshold value according to fluctuation in the sample, and judges that the chromosome 15 has a long-arm front end not exceeding the threshold value, and thus is considered as signal fluctuation, and two copies are determined to be normal.
The results are summarized, the final result is output by comparative analysis based on the output results of the two modules for detecting CNV, and the result is judged to be negative because both models are judged to be negative, so that the slightly strong signal of the long arm of the No. 15 chromosome belongs to the fluctuation of system noise, and is not true micro-repetition, so that the result is judged to be negative.
The specific operation mode of each step can be seen in the scheme described in the specification.
As can be seen from FIG. 3, chromosome 15 of the above sample was considered to be of a normal karyotype by the method of example 1, and it was confirmed that the results were consistent with the actual results.
It can be seen that the method of the present invention utilizes multiple calibration and filtering criteria, greatly reducing the false positive rate.