CN111916150A - Method and device for detecting genome copy number variation - Google Patents

Method and device for detecting genome copy number variation Download PDF

Info

Publication number
CN111916150A
CN111916150A CN201910389538.9A CN201910389538A CN111916150A CN 111916150 A CN111916150 A CN 111916150A CN 201910389538 A CN201910389538 A CN 201910389538A CN 111916150 A CN111916150 A CN 111916150A
Authority
CN
China
Prior art keywords
copy number
window
scn
number variation
effective data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910389538.9A
Other languages
Chinese (zh)
Inventor
刘成琨
程涛
刘鹤
张建光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Berry Genomics Co Ltd
Original Assignee
Berry Genomics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Berry Genomics Co Ltd filed Critical Berry Genomics Co Ltd
Priority to CN201910389538.9A priority Critical patent/CN111916150A/en
Publication of CN111916150A publication Critical patent/CN111916150A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Abstract

The invention provides a method for detecting genome copy number variation, which comprises the following steps: obtaining a genome sequencing sequence of a sample to be detected; aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence; dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window; carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window; standardizing the corrected effective data quantity to obtain an effective depth value of each window; filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term; and calculating the copy number value (SCN) in the potential copy number variation region, and comparing the SCN with the reference range of the copy number to obtain an accurate copy number variation detection result. The invention also provides a device and equipment for implementing the method. The invention establishes a mathematical model for calculating the copy number value SCN for the first time and determines the reference interval of the copy number state of the genome region. In addition, the invention can effectively process the noise in the sequencing data and accurately identify the copy number variation area.

Description

Method and device for detecting genome copy number variation
Technical Field
The present invention relates to the fields of bioinformatics and genomic mutation detection. More particularly, the present invention relates to a method and apparatus for detecting Copy Number Variation (CNV) of a genome.
Background
Copy number variation is a structural abnormality present in the genome, meaning that a DNA fragment of a region of the genome is present in a different number of copies than in the normal population. Common copy number variations include deletions, duplications, chromosomal aneuploidies.
The most common principle for detecting CNVs by NGS data is currently implemented based on calculating depth (read-depth), i.e. calculating the relative level of the depth of a certain section and the depth of the section corresponding to a normal reference sample, and comparing the relative level with a theoretical value of the relative level calculated in advance to determine whether CNVs exist in the section (Yoon et al, 2009; Mason-Suares et al, 2016). However, there are still certain difficulties with CNV detection: on one hand, due to uneven read coverage at various positions of a genome, complexity of a sample, experimental operation, a sequencing process and the like, different degrees of noise are introduced into sequencing data, and the accuracy of a detection result is seriously affected (Boeva et al, 2011). On the other hand, current research on depth versus horizontal theoretical values is also very limited. In order to guarantee the validity of the detection result, deep research needs to be carried out on the theoretical value of the depth relative level, and a set of scientific and reasonable copy number state reference value range is established.
Thus, current methods for determining copy number variation remain to be improved.
Disclosure of Invention
Therefore, the present invention provides a method and an apparatus for detecting genomic copy number variation, which can accurately detect copy number variation including microdeletion/microduplication.
In a first aspect, the present invention provides a method for detecting copy number variation, comprising the steps of:
(1) obtaining a genome sequencing sequence of a sample to be detected;
(2) aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence;
(3) dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window;
(4) carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window;
(5) standardizing the corrected effective data quantity to obtain an effective depth value CR of each window;
(6) filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term;
(7) calculating a copy number value (SCN) in the potential copy number variation region according to the following formula, and comparing the SCN with a reference range of the copy number value to obtain an accurate copy number variation detection result;
SCN=CR*×CNNorm
wherein, CR*For effective depth values in the potential copy number variation region, CNNormIs the theoretical copy number of the potential copy number variation region in a negative sample, which is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
In one embodiment, the genomic sequencing sequence of the test sample is from second generation high throughput sequencing platforms single-ended sequencing or double-ended sequencing, such as from Illumina, NextSeq, NovaSeq, and any other high throughput sequencing platform known in the art.
In one embodiment, "human genome reference sequence" refers to a standard human genome reference sequence in the NCBI database, which may be, for example, hg18, NCBI Build 36; hg19, NCBI Build 37. Human genome reference sequences can be obtained in the genetic data of NCBI, Ensembl and UCSC.
In one embodiment, alignment of the sequencing result sequence to the human genome reference sequence can be performed using algorithms or software known to those skilled in the art. Examples of such algorithms or software include, but are not limited to: BLAST, BLAT, MAQ, SOAP, Bowtie, BWA, SSAHA, ELAND. In one embodiment, the alignment can remove regions of the genomic reference sequence where repetitive sequences are present. In one embodiment, a non-fault tolerant alignment scheme is used and no empty base gaps are allowed.
In one embodiment, step (3) comprises dividing the genomic reference sequence into contiguous windows in units of equal length over the regions that are uniquely aligned in position (i.e., non-repetitive regions). The length of the window can be determined as desired by one skilled in the art. For example, the genomic reference sequence may be divided into contiguous windows of 15Kbp, 20Kbp, 25Kbp, 30Kbp, 35Kbp, and the like.
In one embodiment, the "dynamic data correction" in step (4) includes GC correction, alignment ratio correction, and data amount correction. As used herein, "GC correction" refers to a dynamic GC correction coefficient based on the ratio of the median of the effective data volume for all windows of a sample to the median of the effective data volume for all windows in the sample having the same GC content as the current window. And multiplying the correction coefficient by the original data amount of each window in the sample to respectively obtain the effective data amount after GC correction of each window. The GC correction can effectively correct GC preference in the data, thereby ensuring the accuracy of the detection result. As used herein, "alignment correction" refers to a dynamic alignment correction factor based on the ratio of the median of the effective data amount for all windows of a sample to the median of the effective data amount for all windows in the sample for which the alignment is the same as the current window. And multiplying the comparison rate correction coefficient by the effective data volume after GC correction of each window in the sample to obtain the effective data volume after the comparison rate correction of each window. As used herein, "data amount correction" refers to a ratio of the data amount based on the sample to the effective data amount of all autosomes after alignment correction as a data amount correction coefficient. And multiplying the data volume correction coefficient by the effective data volume after the comparison ratio correction of each window in the sample to respectively obtain the effective data volume after the data volume correction of each window.
In one embodiment, the normalization of step (5) can be performed by comparison to a control set, which is the mean of the corrected effective data amounts for each window from a scale of negative samples (i.e., samples without copy number variation). For example, normalization can be performed by the following formula:
Figure BDA0002055981530000031
wherein CR is the effective depth value for each window;
Figure BDA0002055981530000041
is the effective data volume of each window of the sample to be measured after correction;
Figure BDA0002055981530000042
is the mean of the corrected effective data volume for each window of multiple negative samples.
Due to uneven read coverage on each position of a genome, complexity of a sample, experimental operation, a sequencing process and the like, a large amount of noise data is inevitably contained in sequencing data, so that subsequent data analysis is interfered, and the accuracy of a detection result is seriously influenced. Therefore, it is very necessary to filter noise in data analysis so as to ensure sensitivity and accuracy of detection results. Therefore, in one embodiment, the method of the present invention utilizes the Fused Lasso algorithm to fit the trend of the variation of the normalized effective data volume, thereby achieving the effect of filtering noise. In one embodiment, the amount of valid data after filtering noise is subject to a differential term constraint and potential copy number change sites and potential copy number change regions, i.e., corresponding regions between each potential copy number change site, are identified.
In one embodiment, the copy number value (SCN) within the potential copy number variation region is calculated according to the following formula:
SCN=CR*×CNNorm
wherein, CR*For effective depth values in the potential copy number variation region, CNNormThe theoretical copy number of a negative sample in a certain region of the genome is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
In one embodiment, the reference range of copy numbers in step (7) may be calculated by establishing a mathematical statistical model, such as a statistical model based on poisson distribution. For example, a statistical model is used to model the copy number state and the corresponding copy number value of a sample of a certain scale with known copy number state (copy number missing, copy number normal or copy number repeat), and then the copy number value range corresponding to different copy number states is calculated according to 99% significance level as the reference range of the copy number value.
In one embodiment, the reference ranges for copy number are as follows:
-for autosomes: the normal diploid of the corresponding chromosome detection area is judged when the SCN is more than or equal to 1.78 and less than or equal to 2.24; the SCN is more than or equal to 2.72, and the corresponding chromosome detection region is judged to be repeated; judging that the corresponding chromosome detection region is deleted when the SCN is less than or equal to 1.16; 1.16 < SCN < 1.78 is judged as haploid mosaic of the corresponding chromosome detection region; judging that the chromosome detection area is triploid mosaic when the SCN is more than 2.24 and less than 2.72;
-for sex chromosomes: when the presence of the Y chromosome is not detected: the X chromosome is the autosome with the judgment standard; when the presence of the Y chromosome is detected: SCN of 0.84-1.16 is judged as normal haploid of the corresponding chromosome detection area; the SCN is more than or equal to 1.78, and the corresponding chromosome detection region is judged to be repeated; SCN <0.84 is judged as corresponding chromosome detection region deletion; the result of 1.16 < SCN < 1.78 was judged as diploid chimerism in the corresponding chromosome detection region.
In a second aspect, the present invention provides an apparatus for detecting copy number variation, comprising:
-a sequence acquisition unit (21) for acquiring a genomic sequencing sequence of a sample to be tested;
-a sequence alignment unit (22) for aligning the sequenced sequence to a human genomic reference sequence and determining the position of the unique alignment to the genomic reference sequence;
-an effective data volume statistics unit (23) for dividing the genome reference sequence into equal length windows and counting the number of uniquely aligned sequencing sequences falling into each window to obtain an effective data volume for each window;
-a valid data amount correction unit (24) for correcting the number of sequences per window; preferably, the unit comprises a GC correction module (241), a comparison ratio correction module (242) and a data volume correction module (243);
-an effective data volume normalization unit (25) for normalizing the corrected effective data volume to obtain an effective depth value for each window;
-a copy number variation region identification unit (26) for filtering noise and identifying potential copy number variation regions by constraints on the difference term;
-a copy number calculation unit (27) for calculating a copy number value (SCN) of the respective area,
-a detection result output unit (28) for comparing the SCN with a reference range of copy numbers and outputting a copy number variation detection result.
In a third aspect, the present invention provides a detection apparatus for copy number variation, comprising:
a memory configured to store one or more programs;
a processing unit coupled to the memory and configured to execute the one or more programs to cause the management system to perform a plurality of actions, the actions comprising:
(1) obtaining a genome sequencing sequence of a sample to be detected;
(2) aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence;
(3) dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window;
(4) carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window;
(5) standardizing the corrected effective data quantity to obtain an effective depth value CR of each window;
(6) filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term;
(7) calculating a copy number value (SCN) in the potential copy number variation region according to the following formula, and comparing the SCN with a reference range of the copy number to obtain an accurate copy number variation detection result; SCN ═ CR*×CNNorm
Wherein, CR*For effective depth values in the potential copy number variation region, CNNormIs the theoretical copy number of the potential copy number variation region in a negative sample, which is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method for detecting genomic copy number variations according to the present invention.
In the present invention, a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium includes, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The machine-executable instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives machine-executable instructions from the network and forwards the machine-executable instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Machine executable instructions for performing the operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The machine-executable instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, by utilizing state information of machine-executable instructions to personalize an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), the electronic circuit can execute the machine-executable instructions to implement aspects of the present disclosure.
These machine-executable instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions of various aspects of the present invention. These machine-executable instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the functions of the various aspects of the present invention.
The machine-executable instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions of aspects of the present invention.
The method, the device and the equipment can effectively process the noise in the sequencing data and accurately identify the copy number variation area, thereby accurately detecting various copy number variation conditions in a wide length range, including but not limited to: chromosomal microdeletions, duplications, and an abnormal number of whole chromosomes. In addition, the invention establishes a mathematical model for calculating the copy number value SCN for the first time, and determines the theoretical reference range of the copy number value corresponding to the copy number state of the genome region through a mathematical statistic model.
The invention will be further elucidated with reference to the drawings and examples.
Drawings
FIG. 1 shows a flow chart of the method of detecting copy number variation according to the present invention.
FIG. 2 shows a copy number variation detection apparatus according to the present invention.
Fig. 3 shows data distribution before and after dynamic GC correction.
Fig. 4 shows data distribution before and after dynamic alignment ratio correction.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The described embodiments are illustrative only and are not to be construed as limiting the invention.
Example 1 detection of genomic copy number variation according to the method of the invention
The method for detecting whether the copy number variation exists in the sample to be detected specifically comprises the following steps (see figure 1):
s100: and acquiring a genome sequencing sequence of the sample to be detected.
High throughput sequencing data was obtained for 9 samples of flow products. The data were obtained by library construction using a chromosome copy number variation detection kit (reversible end-stop sequencing) and using a NextSeq CN500 gene sequencer (registration for medical instruments: national institutes of medicine 20153400460, Beirui Hangzhou and Kangshui Gene diagnostics, Inc.).
S102: aligning the sequenced sequence to a human genome reference sequence and determining a unique alignment to a genome reference The position of the sequence.
The genome reference sequence selected by the embodiment of the invention is human reference genome NCBI built 37. In order to avoid the influence of repeated sequences on the copy number variation detection result and improve the quality of sequencing data, sequences which cannot be matched in the alignment process and sequences aligned to multiple positions are removed, and only the uniquely aligned sequences (valid data) are reserved for subsequent analysis by using a non-fault-tolerant alignment mode and not allowing empty base gaps (gaps).
S104: dividing the genomic reference sequence into equal length windows, counting the uniquely aligned sequencing falling within each window And the number of sequences is obtained, and the effective data quantity of each window is obtained.
The region in which the repetitive sequence exists (repetitive region) is removed and the region whose position can be uniquely matched (non-repetitive region) is retained, using the human reference genome NCBI built 37 as a standard. On the non-repetitive region, the reference genome is divided into continuous windows by taking the length of 20Kbp as a unit, the effective data amount (namely, the number of uniquely aligned sequencing sequences) in each 20Kbp window is counted, and simultaneously, the GC content of the effective data in each window is counted for subsequent statistical analysis.
S106: carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window The amount of data.
(1) And (3) GC correction: in high throughput sequencing data, a correlation between effective data volume and GC content, referred to as GC bias, is often shown. In order to ensure the accuracy of the subsequent analysis result, the region with abnormal GC content needs to be filtered out, and the effective data volume of the remaining region is corrected by using the following formula:
Figure BDA0002055981530000091
wherein the content of the first and second substances,
Figure BDA0002055981530000092
corrected effective data volume, ER, for each window GCiFor the raw effective data amount of each window, m is the median of the effective data amounts of all windows of the sample, mGCIs the median of the effective data amounts for all windows in the sample having the same GC content as the current window. The effects before and after GC correction are shown in fig. 3.
(2) And (3) correcting the comparison ratio: besides the influence of GC content, the comparison rate (non-repeat region ratio) of different regions in the reference genome is greatly different, so that the real effective data volume is influenced, and the accuracy of the copy number variation detection result is influenced. Therefore, in order to ensure the accuracy of the subsequent analysis result, a window with abnormal comparison rate needs to be filtered out. The effective data amount after GC correction is also corrected using the following formula:
Figure BDA0002055981530000101
wherein the content of the first and second substances,
Figure BDA0002055981530000102
the ratio corrected effective data amount is compared for each window,
Figure BDA0002055981530000103
the corrected effective data volume for each window GC, m being the median of the effective data volumes for all windows of the sample, mMAPThe median of the effective data volume for all windows in the sample where the alignment rate is the same as the current window. The effects before and after the alignment ratio correction are shown in FIG. 4.
(3) And (3) correcting the data amount: because the actual data amount obtained by sequencing different samples may have differences, the data amount of all samples also needs to be corrected in order to ensure the comparability of effective data amount among samples. The effective data volume for all samples was corrected for the 5M data volume using the following formula:
Figure BDA0002055981530000104
the resulting effective data amount corrected for the data amount in each window
Figure BDA0002055981530000105
Can be used for subsequent statistical analysis.
S108: and standardizing the corrected effective data quantity to obtain an effective depth value of each window.
The sequencing data for 50 negative samples were used and the corrected effective data amount was normalized using the following formula:
Figure BDA0002055981530000106
wherein CR is the effective depth value for each window;
Figure BDA0002055981530000107
is the effective data volume of each window of the sample to be measured after correction;
Figure BDA0002055981530000108
is the mean of the corrected effective data volume for each window of 50 negative samples.
S110: filtering noise using Fused Lasso algorithm and identifying potential copy number by constraining difference terms A region of variation.
After obtaining the effective depth value (CR) of each window, the trend of the change is fitted from the effective depth values using the Fused Lasso algorithm shown in the following formula, and the noise is filtered.
Figure BDA0002055981530000111
Wherein, X: x is I, and I is an identity matrix; y: an effective depth value (CR) for each window; beta: is a parameter to be estimated;
Figure BDA0002055981530000112
is the estimated result (effective depth value after noise reduction processing); lambda [ alpha ]1And λ2L being variables and differential terms, respectively1A regularization term. Through the noise reduction processing of the formula, the L1 regular term (namely lambda) of the variable beta is removed1) The L1 regular term (i.e., λ) for the β difference term is preserved2)。
After λ of each stage is obtained, it is necessary to select an appropriate value of λ to constrain the differential term. In this example, a K-Fold (K ═ 5) method was used to select an appropriate λ value, and the following steps were performed:
dividing all data sets into 5 equal parts;
sequentially selecting each equal score as a test set, and taking the remaining 4 equal scores as a training set;
predicting the test set by using the divided training set, and calculating the Mean Squared Error (MSE) and the standard deviation (sd) of the MSE corresponding to each lambda:
Figure BDA0002055981530000113
then calculate the λ selection threshold: cutoffλ=min(MSE)+sd[min(MSE)];
At less than cutoffλIn the MSEs of (1), the lambda value corresponding to the MSE with the maximum MSE is selected as the lambda of the model.
After the lambda value is obtained, the effective depth value after noise filtering can be obtained.
Then, a first-order difference algorithm is used for detecting the variation trend of the effective depth values, and potential copy number variation areas are identified. For example, the partial results after the filtration by the last Fused Lasso step are: x is the number ofn1,1,1,1,2,2,2, 1,1, x can be calculatedn-1-xnX is obtained when the value is 0,0,0,1,0,0,0, -1,0,0n-1-xnAnd is not equal to 0 at bits 4 and 8. Therefore, the two sites are determined as potential copy number variation sites, and the corresponding region between the two sites is the potential copy number variation region.
S112: calculating a copy number value (SCN) in the potential copy number variation region and comparing the SCN with a reference range of the copy number value And comparing to obtain an accurate copy number variation detection result.
The copy number SCN can be calculated according to the following formula:
SCN=CR*×CNNorm
wherein, CR*Effective depth values in the potential copy number variation area; CNNormThe theoretical copy number for the potential copy number variation region in negative samples is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
Since the effective data amount (ER) in a certain region on the genome is positively correlated with the length of the region on the genome and conforms to the poisson distribution, for a certain detection region of size W, the theoretical distribution of the effective data amount in the region can be obtained: (1) when the copy number of the region is not changed, ER satisfies the parameter of lambda0Poisson distribution Po (λ)0),λ0W/G (where G represents the human reference genome size; N represents the total effective data volume of the sample); (2) when the copy number of the region is repeated (i.e. three copies), the actual detection region size becomes (3/2) × W, and ER satisfies the parameter λdupPoisson distribution Po (λ)dup),λdup=(N*(1.5 *W)/G)=1.5*λ0(ii) a (3) When the region is duplicated in copy number (i.e. is a single copy), the actual detection region size becomes (1/2) × W, and ER satisfies the parameter λdelPoisson distribution Po (λ)del),λdel=(N*(0.5*W)/G)=0.5*λ0
And (3) carrying out simulation for multiple times according to Poisson distribution satisfied by ER under different copy number states by knowing the total effective data volume (N) and the size (W) of the detection area of the sample under the known copy number state, so as to obtain ER values of the detection area under the conditions of normal copy number, deletion and repetition. Then, the copy number value SCN is calculated by using the formula of the invention, and the reference range of the copy number value is calculated according to the distribution condition of the SCN. In a specific embodiment, the reference ranges for copy numbers are as follows:
-for autosomes: the normal diploid of the corresponding chromosome detection area is judged when the SCN is more than or equal to 1.78 and less than or equal to 2.24; the SCN is more than or equal to 2.72, and the corresponding chromosome detection region is judged to be repeated; judging that the corresponding chromosome detection region is deleted when the SCN is less than or equal to 1.16; 1.16 < SCN < 1.78 is judged as haploid mosaic of the corresponding chromosome detection region; judging that the chromosome detection area is triploid mosaic when the SCN is more than 2.24 and less than 2.72;
-for sex chromosomes: when the presence of the Y chromosome is not detected: the X chromosome is the autosome with the judgment standard; when the presence of the Y chromosome is detected: SCN of 0.84-1.16 is judged as normal haploid of the corresponding chromosome detection area; the SCN is more than or equal to 1.78, and the corresponding chromosome detection region is judged to be repeated; SCN <0.84 is judged as corresponding chromosome detection region deletion; the result of 1.16 < SCN < 1.78 was judged as diploid chimerism in the corresponding chromosome detection region.
In addition, the copy number status of 9 samples was tested using a chip to verify the accuracy of the test method of the present invention. The results of the detection and the results of the verification are shown in Table 1 (in the table, dup represents an increase in copy number and del represents a deletion in copy number).
TABLE 1.9 CNV assay results for samples
Figure BDA0002055981530000131
Figure BDA0002055981530000141
As can be seen from Table 1, the present invention can accurately detect copy number variation from chromosomal microdeletion to chromosomal number abnormality and provide location information of the copy number variation with an accuracy of 100%. The method of the present invention can detect a wide range of lengths, and can detect copy number variations in a range of lengths from less than 1M (e.g., 0.75M) to 80M, and even to the entire chromosome.
The foregoing is merely an alternative embodiment of the present application and is not intended to limit the present disclosure, as numerous modifications and variations will readily occur to those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Reference to the literature
Yoon S,Xuan Z,Makarov V,Ye K,Sebat J.Sensitive and accurate detection of copy number variants using read depth of coverage.Genome Res.2009,19(9):1586-1592.
Mason-Suares H,Landry L,Lebo MS.Detecting Copy Number Variation via Next Generation Technology.Curr.Genet.Med. Report.2016,4(3):1-12.
Boeva V.,et al.Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,Bioinformatics,2011,vol.27(pg.268-269)。

Claims (12)

1. A method for detecting copy number variation, comprising the steps of:
(1) obtaining a genome sequencing sequence of a sample to be detected;
(2) aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence;
(3) dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window;
(4) carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window;
(5) standardizing the corrected effective data quantity to obtain an effective depth value CR of each window;
(6) filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term;
(7) calculating a copy number value (SCN) in the potential copy number variation region according to the following formula, and comparing the SCN with a reference range of the copy number value to obtain an accurate copy number variation detection result;
SCN=CR*×CNNorm
wherein, CR*For effective depth values in the potential copy number variation region, CNNormIs the theoretical copy number of the potential copy number variation region in a negative sample, which is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
2. The method of claim 1, wherein the alignment in step (2) removes regions of the genomic reference sequence where repeated sequences are present.
3. The method of claim 1, wherein step (2) uses a non-fault tolerant alignment scheme and does not allow for empty base gaps.
4. The method of claim 1, wherein step (3) comprises partitioning the genomic reference sequence into contiguous windows of equal length over the uniquely aligned regions.
5. The method of claim 1, wherein the correction in step (4) comprises GC correction, alignment correction, and data volume correction.
6. The method of claim 1, wherein the normalization in step (5) is performed by comparison to a control set, which is the mean of the corrected effective data amounts for each window from a negative sample of a given size.
7. The method of claim 1, wherein the reference range of copy number values is calculated by: and (3) carrying out statistical modeling on the copy number state and the corresponding copy number value of a certain-scale sample with a known copy number state by using a mathematical statistical model, and then calculating the copy number value reference ranges corresponding to different copy number states according to the 99% significance level.
8. The method of claim 1, wherein the reference ranges of copy numbers are as follows:
-for autosomes: the normal diploid of the corresponding chromosome detection area is judged when the SCN is more than or equal to 1.78 and less than or equal to 2.24; the SCN is more than or equal to 2.72, and the corresponding chromosome detection region is judged to be repeated; judging that the corresponding chromosome detection region is deleted when the SCN is less than or equal to 1.16; 1.16 < SCN < 1.78 is judged as haploid mosaic of the corresponding chromosome detection region; judging that the chromosome detection area is triploid mosaic when the SCN is more than 2.24 and less than 2.72;
-for sex chromosomes: when the presence of the Y chromosome is not detected: the X chromosome is the autosome with the judgment standard; when the presence of the Y chromosome is detected: SCN of 0.84-1.16 is judged as normal haploid of the corresponding chromosome detection area; the SCN is more than or equal to 1.78, and the corresponding chromosome detection region is judged to be repeated; SCN <0.84 is judged as corresponding chromosome detection region deletion; the result of 1.16 < SCN < 1.78 was judged as diploid chimerism in the corresponding chromosome detection region.
9. An apparatus for detecting copy number variation, comprising:
-a sequence acquisition unit for acquiring a genomic sequencing sequence of a sample to be tested;
-a sequence alignment unit for aligning the sequenced sequence to a human genome reference sequence and determining the position of the unique alignment to the genome reference sequence;
-an effective data volume statistics unit for dividing the genome reference sequence into equal length windows and counting the number of uniquely aligned sequencing sequences falling into each window to obtain an effective data volume for each window;
-a valid data amount correction unit for correcting the number of sequences per window;
-an effective data amount normalization unit for normalizing the corrected effective data amount to obtain an effective depth value for each window;
-a copy number variation region identification unit for filtering noise and identifying potential copy number variation regions by constraints on the difference terms;
-a copy number calculation unit for calculating a copy number value (SCN) of the respective area,
-a detection result output unit for comparing the SCN with a reference range of copy numbers and outputting a copy number variation detection result.
10. The apparatus of claim 9, wherein the effective data amount correction unit includes a GC correction module, a contrast correction module, and a data amount correction module.
11. A detection apparatus for copy number variation, comprising:
a memory configured to store one or more programs;
a processing unit coupled to the memory and configured to execute the one or more programs to cause the management system to perform a plurality of actions, the actions comprising:
(1) obtaining a genome sequencing sequence of a sample to be detected;
(2) aligning the sequencing sequence to a human genome reference sequence and determining the position of unique alignment to the genome reference sequence;
(3) dividing the genome reference sequence into equal-length windows, and counting the number of uniquely-compared sequencing sequences falling into each window to obtain the effective data volume of each window;
(4) carrying out dynamic data correction on the effective data volume of each window to obtain the corrected effective data volume of each window;
(5) standardizing the corrected effective data quantity to obtain an effective depth value CR of each window;
(6) filtering noise by using a Fused Lasso algorithm, and identifying potential copy number variation areas through the constraint of a difference term;
(7) calculating a copy number value (SCN) in the potential copy number variation region according to the following formula, and comparing the SCN with a reference range of the copy number to obtain an accurate copy number variation detection result;
SCN=CR*×CNNorm
wherein, CR*For effective depth values in the potential copy number variation region, CNNormIs the theoretical copy number of the potential copy number variation region in a negative sample, which is 2 for autosomes and female X chromosomes and 1 for male sex chromosomes.
12. A computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method of any one of claims 1 to 8.
CN201910389538.9A 2019-05-10 2019-05-10 Method and device for detecting genome copy number variation Pending CN111916150A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910389538.9A CN111916150A (en) 2019-05-10 2019-05-10 Method and device for detecting genome copy number variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910389538.9A CN111916150A (en) 2019-05-10 2019-05-10 Method and device for detecting genome copy number variation

Publications (1)

Publication Number Publication Date
CN111916150A true CN111916150A (en) 2020-11-10

Family

ID=73242222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910389538.9A Pending CN111916150A (en) 2019-05-10 2019-05-10 Method and device for detecting genome copy number variation

Country Status (1)

Country Link
CN (1) CN111916150A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
CN113299342A (en) * 2021-06-17 2021-08-24 苏州贝康医疗器械有限公司 Copy number variation detection method and device based on chip data
CN113327646A (en) * 2021-06-30 2021-08-31 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN113571132A (en) * 2021-09-24 2021-10-29 苏州赛美科基因科技有限公司 Method for judging sample degradation based on CNV result
CN113674803A (en) * 2021-08-30 2021-11-19 广州燃石医学检验所有限公司 Detection method of copy number variation and application thereof
CN114582427A (en) * 2022-03-22 2022-06-03 成都基因汇科技有限公司 Method for identifying introgression section and computer readable storage medium
CN114758720A (en) * 2022-06-14 2022-07-15 北京贝瑞和康生物技术有限公司 Methods, apparatus, and media for detecting copy number variation
CN114792548A (en) * 2022-06-14 2022-07-26 北京贝瑞和康生物技术有限公司 Methods, apparatus and media for correcting sequencing data, detecting copy number variations
CN114999573A (en) * 2022-04-14 2022-09-02 哈尔滨因极科技有限公司 Genome variation detection method and detection system
CN115273984A (en) * 2022-09-30 2022-11-01 北京诺禾致源科技股份有限公司 Method and device for identifying genome tandem repeat region
CN113327646B (en) * 2021-06-30 2024-04-23 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150012252A1 (en) * 2012-01-20 2015-01-08 Bgi Diagnosis Co., Ltd. Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN104745718A (en) * 2015-04-23 2015-07-01 北京嘉宝仁和医疗科技有限公司 Method for detecting chromosome microdeletion and micro-duplication of human embryo
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
WO2018161245A1 (en) * 2017-03-07 2018-09-13 深圳华大基因研究院 Method and device for detecting chromosomal variations
US20180327844A1 (en) * 2015-11-16 2018-11-15 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150012252A1 (en) * 2012-01-20 2015-01-08 Bgi Diagnosis Co., Ltd. Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN104745718A (en) * 2015-04-23 2015-07-01 北京嘉宝仁和医疗科技有限公司 Method for detecting chromosome microdeletion and micro-duplication of human embryo
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
US20180327844A1 (en) * 2015-11-16 2018-11-15 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
WO2018161245A1 (en) * 2017-03-07 2018-09-13 深圳华大基因研究院 Method and device for detecting chromosomal variations

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ROBERT TIBSHIRANI ET AL.: "Sparsity and smoothness via the fused lasso", 《J. R. STATIST. SOC. B》, pages 91 *
张环: "Fused-LASSO惩罚最小一乘回归的统计分析与优化算法", 《中国优秀硕士学位论文全文数据库 基础科学辑》, pages 002 - 138 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
CN113299342A (en) * 2021-06-17 2021-08-24 苏州贝康医疗器械有限公司 Copy number variation detection method and device based on chip data
CN113299342B (en) * 2021-06-17 2024-03-15 苏州贝康医疗器械有限公司 Copy number variation detection method and detection device based on chip data
CN113327646A (en) * 2021-06-30 2021-08-31 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN113327646B (en) * 2021-06-30 2024-04-23 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN113674803B (en) * 2021-08-30 2023-08-08 广州燃石医学检验所有限公司 Copy number variation detection method, device, storage medium and application thereof
CN113674803A (en) * 2021-08-30 2021-11-19 广州燃石医学检验所有限公司 Detection method of copy number variation and application thereof
CN113571132A (en) * 2021-09-24 2021-10-29 苏州赛美科基因科技有限公司 Method for judging sample degradation based on CNV result
CN114582427A (en) * 2022-03-22 2022-06-03 成都基因汇科技有限公司 Method for identifying introgression section and computer readable storage medium
CN114999573A (en) * 2022-04-14 2022-09-02 哈尔滨因极科技有限公司 Genome variation detection method and detection system
CN114792548B (en) * 2022-06-14 2022-09-09 北京贝瑞和康生物技术有限公司 Methods, apparatus and media for correcting sequencing data, detecting copy number variations
CN114758720B (en) * 2022-06-14 2022-09-02 北京贝瑞和康生物技术有限公司 Method, apparatus and medium for detecting copy number variation
CN114792548A (en) * 2022-06-14 2022-07-26 北京贝瑞和康生物技术有限公司 Methods, apparatus and media for correcting sequencing data, detecting copy number variations
CN114758720A (en) * 2022-06-14 2022-07-15 北京贝瑞和康生物技术有限公司 Methods, apparatus, and media for detecting copy number variation
CN115273984A (en) * 2022-09-30 2022-11-01 北京诺禾致源科技股份有限公司 Method and device for identifying genome tandem repeat region
CN115273984B (en) * 2022-09-30 2022-11-29 北京诺禾致源科技股份有限公司 Method and device for identifying genome tandem repeat region

Similar Documents

Publication Publication Date Title
CN111916150A (en) Method and device for detecting genome copy number variation
CN107423578B (en) Device for detecting somatic cell mutation
CN109949861B (en) Tumor mutation load detection method, device and storage medium
CN109411015B (en) Tumor mutation load detection device based on circulating tumor DNA and storage medium
EP3341875A1 (en) An integrated method and system for identifying functional patient-specific somatic aberations using multi-omic cancer profiles
CN109887546B (en) Single-gene or multi-gene copy number detection system and method based on next-generation sequencing
CN111755068B (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
CN110808084B (en) Copy number variation detection method based on single-sample second-generation sequencing data
CN108256289A (en) A kind of method based on target area capture sequencing genomes copy number variation
Kremer et al. Approaches for in silico finishing of microbial genome sequences
CN106529211A (en) Variable site obtaining method and apparatus
CN113674803A (en) Detection method of copy number variation and application thereof
CN117334249A (en) Method, apparatus and medium for detecting copy number variation based on amplicon sequencing data
KR20220073732A (en) Method, apparatus and computer readable medium for adaptive normalization of analyte levels
Rafajlović et al. Demography-adjusted tests of neutrality based on genome-wide SNP data
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
CN105849284B (en) Method and apparatus for separating quality levels in sequence data and sequencing longer reads
CN113724781B (en) Method and apparatus for detecting homozygous deletions
US20220068437A1 (en) Base mutation detection method and apparatus based on sequencing data, and storage medium
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
Halpin et al. Multimapping confounds ribosome profiling analysis: A case‐study of the Hsp90 molecular chaperone
AU2018391843B2 (en) Sequencing data-based ITD mutation ratio detecting apparatus and method
US20180365378A1 (en) Stable genes in comparative transcriptomics
Zachariasen et al. Identification of representative species-specific genes for abundance measurements
CN114703263B (en) Group chromosome copy number variation detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40040594

Country of ref document: HK

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 102206 8th Floor 801, No. 5 Building, No. 4 Life Garden Road, Changping District Science Park, Beijing

Applicant after: BERRY GENOMICS Co.,Ltd.

Address before: 102299 room 801, floor 8, building 5, courtyard 4, shengshengyuan Road, science and Technology Park, Changping District, Beijing

Applicant before: BERRY GENOMICS Co.,Ltd.