Disclosure of Invention
Therefore, the present invention provides a method and an apparatus for detecting genomic copy number variation, which can accurately detect copy number variation including microdeletion/microduplication.
In a first aspect, the present invention provides a method for detecting copy number variation, comprising the steps of:
(1) obtaining a genome sequencing sequence of a sample to be detected;
(2) aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence;
(3) dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window;
(4) carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window;
(5) standardizing the corrected effective data quantity to obtain an effective depth value CR of each window;
(6) filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term;
(7) calculating a copy number value (SCN) in the potential copy number variation region according to the following formula, and comparing the SCN with a reference range of the copy number value to obtain an accurate copy number variation detection result;
SCN=CR*×CNNorm,
wherein, CR*For effective depth values in the potential copy number variation region, CNNormIs the theoretical copy number of the potential copy number variation region in a negative sample, which is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
In one embodiment, the genomic sequencing sequence of the test sample is from second generation high throughput sequencing platforms single-ended sequencing or double-ended sequencing, such as from Illumina, NextSeq, NovaSeq, and any other high throughput sequencing platform known in the art.
In one embodiment, "human genome reference sequence" refers to a standard human genome reference sequence in the NCBI database, which may be, for example, hg18, NCBI Build 36; hg19, NCBI Build 37. Human genome reference sequences can be obtained in the genetic data of NCBI, Ensembl and UCSC.
In one embodiment, alignment of the sequencing result sequence to the human genome reference sequence can be performed using algorithms or software known to those skilled in the art. Examples of such algorithms or software include, but are not limited to: BLAST, BLAT, MAQ, SOAP, Bowtie, BWA, SSAHA, ELAND. In one embodiment, the alignment can remove regions of the genomic reference sequence where repetitive sequences are present. In one embodiment, a non-fault tolerant alignment scheme is used and no empty base gaps are allowed.
In one embodiment, step (3) comprises dividing the genomic reference sequence into contiguous windows in units of equal length over the regions that are uniquely aligned in position (i.e., non-repetitive regions). The length of the window can be determined as desired by one skilled in the art. For example, the genomic reference sequence may be divided into contiguous windows of 15Kbp, 20Kbp, 25Kbp, 30Kbp, 35Kbp, and the like.
In one embodiment, the "dynamic data correction" in step (4) includes GC correction, alignment ratio correction, and data amount correction. As used herein, "GC correction" refers to a dynamic GC correction coefficient based on the ratio of the median of the effective data volume for all windows of a sample to the median of the effective data volume for all windows in the sample having the same GC content as the current window. And multiplying the correction coefficient by the original data amount of each window in the sample to respectively obtain the effective data amount after GC correction of each window. The GC correction can effectively correct GC preference in the data, thereby ensuring the accuracy of the detection result. As used herein, "alignment correction" refers to a dynamic alignment correction factor based on the ratio of the median of the effective data amount for all windows of a sample to the median of the effective data amount for all windows in the sample for which the alignment is the same as the current window. And multiplying the comparison rate correction coefficient by the effective data volume after GC correction of each window in the sample to obtain the effective data volume after the comparison rate correction of each window. As used herein, "data amount correction" refers to a ratio of the data amount based on the sample to the effective data amount of all autosomes after alignment correction as a data amount correction coefficient. And multiplying the data volume correction coefficient by the effective data volume after the comparison ratio correction of each window in the sample to respectively obtain the effective data volume after the data volume correction of each window.
In one embodiment, the normalization of step (5) can be performed by comparison to a control set, which is the mean of the corrected effective data amounts for each window from a scale of negative samples (i.e., samples without copy number variation). For example, normalization can be performed by the following formula:
wherein CR is the effective depth value for each window;
is the effective data volume of each window of the sample to be measured after correction;
is the mean of the corrected effective data volume for each window of multiple negative samples.
Due to uneven read coverage on each position of a genome, complexity of a sample, experimental operation, a sequencing process and the like, a large amount of noise data is inevitably contained in sequencing data, so that subsequent data analysis is interfered, and the accuracy of a detection result is seriously influenced. Therefore, it is very necessary to filter noise in data analysis so as to ensure sensitivity and accuracy of detection results. Therefore, in one embodiment, the method of the present invention utilizes the Fused Lasso algorithm to fit the trend of the variation of the normalized effective data volume, thereby achieving the effect of filtering noise. In one embodiment, the amount of valid data after filtering noise is subject to a differential term constraint and potential copy number change sites and potential copy number change regions, i.e., corresponding regions between each potential copy number change site, are identified.
In one embodiment, the copy number value (SCN) within the potential copy number variation region is calculated according to the following formula:
SCN=CR*×CNNorm,
wherein, CR*For effective depth values in the potential copy number variation region, CNNormThe theoretical copy number of a negative sample in a certain region of the genome is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
In one embodiment, the reference range of copy numbers in step (7) may be calculated by establishing a mathematical statistical model, such as a statistical model based on poisson distribution. For example, a statistical model is used to model the copy number state and the corresponding copy number value of a sample of a certain scale with known copy number state (copy number missing, copy number normal or copy number repeat), and then the copy number value range corresponding to different copy number states is calculated according to 99% significance level as the reference range of the copy number value.
In one embodiment, the reference ranges for copy number are as follows:
-for autosomes: the normal diploid of the corresponding chromosome detection area is judged when the SCN is more than or equal to 1.78 and less than or equal to 2.24; the SCN is more than or equal to 2.72, and the corresponding chromosome detection region is judged to be repeated; judging that the corresponding chromosome detection region is deleted when the SCN is less than or equal to 1.16; 1.16 < SCN < 1.78 is judged as haploid mosaic of the corresponding chromosome detection region; judging that the chromosome detection area is triploid mosaic when the SCN is more than 2.24 and less than 2.72;
-for sex chromosomes: when the presence of the Y chromosome is not detected: the X chromosome is the autosome with the judgment standard; when the presence of the Y chromosome is detected: SCN of 0.84-1.16 is judged as normal haploid of the corresponding chromosome detection area; the SCN is more than or equal to 1.78, and the corresponding chromosome detection region is judged to be repeated; SCN <0.84 is judged as corresponding chromosome detection region deletion; the result of 1.16 < SCN < 1.78 was judged as diploid chimerism in the corresponding chromosome detection region.
In a second aspect, the present invention provides an apparatus for detecting copy number variation, comprising:
-a sequence acquisition unit (21) for acquiring a genomic sequencing sequence of a sample to be tested;
-a sequence alignment unit (22) for aligning the sequenced sequence to a human genomic reference sequence and determining the position of the unique alignment to the genomic reference sequence;
-an effective data volume statistics unit (23) for dividing the genome reference sequence into equal length windows and counting the number of uniquely aligned sequencing sequences falling into each window to obtain an effective data volume for each window;
-a valid data amount correction unit (24) for correcting the number of sequences per window; preferably, the unit comprises a GC correction module (241), a comparison ratio correction module (242) and a data volume correction module (243);
-an effective data volume normalization unit (25) for normalizing the corrected effective data volume to obtain an effective depth value for each window;
-a copy number variation region identification unit (26) for filtering noise and identifying potential copy number variation regions by constraints on the difference term;
-a copy number calculation unit (27) for calculating a copy number value (SCN) of the respective area,
-a detection result output unit (28) for comparing the SCN with a reference range of copy numbers and outputting a copy number variation detection result.
In a third aspect, the present invention provides a detection apparatus for copy number variation, comprising:
a memory configured to store one or more programs;
a processing unit coupled to the memory and configured to execute the one or more programs to cause the management system to perform a plurality of actions, the actions comprising:
(1) obtaining a genome sequencing sequence of a sample to be detected;
(2) aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence;
(3) dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window;
(4) carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window;
(5) standardizing the corrected effective data quantity to obtain an effective depth value CR of each window;
(6) filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term;
(7) calculating a copy number value (SCN) in the potential copy number variation region according to the following formula, and comparing the SCN with a reference range of the copy number to obtain an accurate copy number variation detection result; SCN ═ CR*×CNNorm,
Wherein, CR*For effective depth values in the potential copy number variation region, CNNormIs the theoretical copy number of the potential copy number variation region in a negative sample, which is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method for detecting genomic copy number variations according to the present invention.
In the present invention, a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium includes, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The machine-executable instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives machine-executable instructions from the network and forwards the machine-executable instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Machine executable instructions for performing the operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The machine-executable instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, by utilizing state information of machine-executable instructions to personalize an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), the electronic circuit can execute the machine-executable instructions to implement aspects of the present disclosure.
These machine-executable instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions of various aspects of the present invention. These machine-executable instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the functions of the various aspects of the present invention.
The machine-executable instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions of aspects of the present invention.
The method, the device and the equipment can effectively process the noise in the sequencing data and accurately identify the copy number variation area, thereby accurately detecting various copy number variation conditions in a wide length range, including but not limited to: chromosomal microdeletions, duplications, and an abnormal number of whole chromosomes. In addition, the invention establishes a mathematical model for calculating the copy number value SCN for the first time, and determines the theoretical reference range of the copy number value corresponding to the copy number state of the genome region through a mathematical statistic model.
The invention will be further elucidated with reference to the drawings and examples.
Example 1 detection of genomic copy number variation according to the method of the invention
The method for detecting whether the copy number variation exists in the sample to be detected specifically comprises the following steps (see figure 1):
s100: and acquiring a genome sequencing sequence of the sample to be detected.
High throughput sequencing data was obtained for 9 samples of flow products. The data were obtained by library construction using a chromosome copy number variation detection kit (reversible end-stop sequencing) and using a NextSeq CN500 gene sequencer (registration for medical instruments: national institutes of medicine 20153400460, Beirui Hangzhou and Kangshui Gene diagnostics, Inc.).
S102: aligning the sequenced sequence to a human genome reference sequence and determining a unique alignment to a genome reference
The position of the sequence.
The genome reference sequence selected by the embodiment of the invention is human reference genome NCBI built 37. In order to avoid the influence of repeated sequences on the copy number variation detection result and improve the quality of sequencing data, sequences which cannot be matched in the alignment process and sequences aligned to multiple positions are removed, and only the uniquely aligned sequences (valid data) are reserved for subsequent analysis by using a non-fault-tolerant alignment mode and not allowing empty base gaps (gaps).
S104: dividing the genomic reference sequence into equal length windows, counting the uniquely aligned sequencing falling within each window
And the number of sequences is obtained, and the effective data quantity of each window is obtained.
The region in which the repetitive sequence exists (repetitive region) is removed and the region whose position can be uniquely matched (non-repetitive region) is retained, using the human reference genome NCBI built 37 as a standard. On the non-repetitive region, the reference genome is divided into continuous windows by taking the length of 20Kbp as a unit, the effective data amount (namely, the number of uniquely aligned sequencing sequences) in each 20Kbp window is counted, and simultaneously, the GC content of the effective data in each window is counted for subsequent statistical analysis.
S106: carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window
The amount of data.
(1) And (3) GC correction: in high throughput sequencing data, a correlation between effective data volume and GC content, referred to as GC bias, is often shown. In order to ensure the accuracy of the subsequent analysis result, the region with abnormal GC content needs to be filtered out, and the effective data volume of the remaining region is corrected by using the following formula:
wherein the content of the first and second substances,
corrected effective data volume, ER, for each window GC
iFor the raw effective data amount of each window, m is the median of the effective data amounts of all windows of the sample, m
GCIs the median of the effective data amounts for all windows in the sample having the same GC content as the current window. The effects before and after GC correction are shown in fig. 3.
(2) And (3) correcting the comparison ratio: besides the influence of GC content, the comparison rate (non-repeat region ratio) of different regions in the reference genome is greatly different, so that the real effective data volume is influenced, and the accuracy of the copy number variation detection result is influenced. Therefore, in order to ensure the accuracy of the subsequent analysis result, a window with abnormal comparison rate needs to be filtered out. The effective data amount after GC correction is also corrected using the following formula:
wherein the content of the first and second substances,
the ratio corrected effective data amount is compared for each window,
the corrected effective data volume for each window GC, m being the median of the effective data volumes for all windows of the sample, m
MAPThe median of the effective data volume for all windows in the sample where the alignment rate is the same as the current window. The effects before and after the alignment ratio correction are shown in FIG. 4.
(3) And (3) correcting the data amount: because the actual data amount obtained by sequencing different samples may have differences, the data amount of all samples also needs to be corrected in order to ensure the comparability of effective data amount among samples. The effective data volume for all samples was corrected for the 5M data volume using the following formula:
the resulting effective data amount corrected for the data amount in each window
Can be used for subsequent statistical analysis.
S108: and standardizing the corrected effective data quantity to obtain an effective depth value of each window.
The sequencing data for 50 negative samples were used and the corrected effective data amount was normalized using the following formula:
wherein CR is the effective depth value for each window;
is the effective data volume of each window of the sample to be measured after correction;
is the mean of the corrected effective data volume for each window of 50 negative samples.
S110: filtering noise using Fused Lasso algorithm and identifying potential copy number by constraining difference terms
A region of variation.
After obtaining the effective depth value (CR) of each window, the trend of the change is fitted from the effective depth values using the Fused Lasso algorithm shown in the following formula, and the noise is filtered.
Wherein, X: x is I, and I is an identity matrix; y: an effective depth value (CR) for each window; beta: is a parameter to be estimated;
is the estimated result (effective depth value after noise reduction processing); lambda [ alpha ]
1And λ
2L being variables and differential terms, respectively
1A regularization term. Through the noise reduction processing of the formula, the L1 regular term (namely lambda) of the variable beta is removed
1) The L1 regular term (i.e., λ) for the β difference term is preserved
2)。
After λ of each stage is obtained, it is necessary to select an appropriate value of λ to constrain the differential term. In this example, a K-Fold (K ═ 5) method was used to select an appropriate λ value, and the following steps were performed:
dividing all data sets into 5 equal parts;
sequentially selecting each equal score as a test set, and taking the remaining 4 equal scores as a training set;
predicting the test set by using the divided training set, and calculating the Mean Squared Error (MSE) and the standard deviation (sd) of the MSE corresponding to each lambda:
then calculate the λ selection threshold: cutoffλ=min(MSE)+sd[min(MSE)];
At less than cutoffλIn the MSEs of (1), the lambda value corresponding to the MSE with the maximum MSE is selected as the lambda of the model.
After the lambda value is obtained, the effective depth value after noise filtering can be obtained.
Then, a first-order difference algorithm is used for detecting the variation trend of the effective depth values, and potential copy number variation areas are identified. For example, the partial results after the filtration by the last Fused Lasso step are: x is the number ofn1,1,1,1,2,2,2, 1,1, x can be calculatedn-1-xnX is obtained when the value is 0,0,0,1,0,0,0, -1,0,0n-1-xnAnd is not equal to 0 at bits 4 and 8. Therefore, the two sites are determined as potential copy number variation sites, and the corresponding region between the two sites is the potential copy number variation region.
S112: calculating a copy number value (SCN) in the potential copy number variation region and comparing the SCN with a reference range of the copy number value
And comparing to obtain an accurate copy number variation detection result.
The copy number SCN can be calculated according to the following formula:
SCN=CR*×CNNorm
wherein, CR*Effective depth values in the potential copy number variation area; CNNormThe theoretical copy number for the potential copy number variation region in negative samples is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
Since the effective data amount (ER) in a certain region on the genome is positively correlated with the length of the region on the genome and conforms to the poisson distribution, for a certain detection region of size W, the theoretical distribution of the effective data amount in the region can be obtained: (1) when the copy number of the region is not changed, ER satisfies the parameter of lambda0Poisson distribution Po (λ)0),λ0W/G (where G represents the human reference genome size; N represents the total effective data volume of the sample); (2) when the copy number of the region is repeated (i.e. three copies), the actual detection region size becomes (3/2) × W, and ER satisfies the parameter λdupPoisson distribution Po (λ)dup),λdup=(N*(1.5 *W)/G)=1.5*λ0(ii) a (3) When the region is duplicated in copy number (i.e. is a single copy), the actual detection region size becomes (1/2) × W, and ER satisfies the parameter λdelPoisson distribution Po (λ)del),λdel=(N*(0.5*W)/G)=0.5*λ0。
And (3) carrying out simulation for multiple times according to Poisson distribution satisfied by ER under different copy number states by knowing the total effective data volume (N) and the size (W) of the detection area of the sample under the known copy number state, so as to obtain ER values of the detection area under the conditions of normal copy number, deletion and repetition. Then, the copy number value SCN is calculated by using the formula of the invention, and the reference range of the copy number value is calculated according to the distribution condition of the SCN. In a specific embodiment, the reference ranges for copy numbers are as follows:
-for autosomes: the normal diploid of the corresponding chromosome detection area is judged when the SCN is more than or equal to 1.78 and less than or equal to 2.24; the SCN is more than or equal to 2.72, and the corresponding chromosome detection region is judged to be repeated; judging that the corresponding chromosome detection region is deleted when the SCN is less than or equal to 1.16; 1.16 < SCN < 1.78 is judged as haploid mosaic of the corresponding chromosome detection region; judging that the chromosome detection area is triploid mosaic when the SCN is more than 2.24 and less than 2.72;
-for sex chromosomes: when the presence of the Y chromosome is not detected: the X chromosome is the autosome with the judgment standard; when the presence of the Y chromosome is detected: SCN of 0.84-1.16 is judged as normal haploid of the corresponding chromosome detection area; the SCN is more than or equal to 1.78, and the corresponding chromosome detection region is judged to be repeated; SCN <0.84 is judged as corresponding chromosome detection region deletion; the result of 1.16 < SCN < 1.78 was judged as diploid chimerism in the corresponding chromosome detection region.
In addition, the copy number status of 9 samples was tested using a chip to verify the accuracy of the test method of the present invention. The results of the detection and the results of the verification are shown in Table 1 (in the table, dup represents an increase in copy number and del represents a deletion in copy number).
TABLE 1.9 CNV assay results for samples
As can be seen from Table 1, the present invention can accurately detect copy number variation from chromosomal microdeletion to chromosomal number abnormality and provide location information of the copy number variation with an accuracy of 100%. The method of the present invention can detect a wide range of lengths, and can detect copy number variations in a range of lengths from less than 1M (e.g., 0.75M) to 80M, and even to the entire chromosome.
The foregoing is merely an alternative embodiment of the present application and is not intended to limit the present disclosure, as numerous modifications and variations will readily occur to those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Reference to the literature
Yoon S,Xuan Z,Makarov V,Ye K,Sebat J.Sensitive and accurate detection of copy number variants using read depth of coverage.Genome Res.2009,19(9):1586-1592.
Mason-Suares H,Landry L,Lebo MS.Detecting Copy Number Variation via Next Generation Technology.Curr.Genet.Med. Report.2016,4(3):1-12.
Boeva V.,et al.Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,Bioinformatics,2011,vol.27(pg.268-269)。