CN111599407B

CN111599407B - Method and device for detecting copy number variation

Info

Publication number: CN111599407B
Application number: CN202010403942.XA
Authority: CN
Inventors: 曹善柏; 王文平; 张萌萌; 郭璟; 楼峰
Original assignee: Beijing Xiangxin Medical Technology Co ltd; Tianjin Xiangxin Biotechnology Co ltd; Beijing Xiangxin Biotechnology Co ltd
Current assignee: Beijing Xiangxin Medical Technology Co ltd; Tianjin Xiangxin Biotechnology Co ltd; Beijing Xiangxin Biotechnology Co ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2021-10-15
Anticipated expiration: 2040-05-13
Also published as: CN111599407A

Abstract

The invention provides a method and a device for detecting copy number variation. The detection method comprises the following steps: obtaining sequencing comparison data of a sample to be detected; calculating the sequencing depth of each base site in sequencing comparison data; dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; bins with copy numbers different from the ploidy of the designated contigs were pooled to obtain the region where germline copy number variation occurred. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, can detect the copy number variation condition of certain known sites, can detect the unknown copy number variation condition and improve the detection sensitivity.

Description

Method and device for detecting copy number variation

Technical Field

The invention relates to the field of biological information analysis, in particular to a method and a device for detecting copy number variation.

Background

CNV refers to copy number polymorphisms of greater than 1kb in length, and is a type of genomic Structural Variation (SV) including deletion (deletion), insertion (insertion), duplication (duplication), and complex multi-site variation (complex multi-site variants). One of the production mechanisms of CNV is DNA recombination, including non-allelic homologous recombination (NAHR), non-homologous end-joining (NHEJ), and the like. CNV caused by DNA recombination can affect gene expression from several aspects: (1) gene dosage; (2) gene disruption; (3) gene fusion; (4) a position effect; (5) dominant recessive alleles, and the like.

The detection of CNV is currently performed by the following methods:

in the multiple ligation amplification (MLPA) technique, two adjacent probes are designed for each target gene to be detected, after the probes are paired and hybridized with a target sequence through a universal primer, the two adjacent probes are connected through a ligation reaction, and the amount of a ligation product is in direct proportion to the copy number of the target gene. The ligation product can be analyzed for gene copy number according to electrophoresis results after PCR amplification.

Chip technology, which is to make the target of interest into a microarray chip and systematically scan key regions in the genome. At present, the widely used chips mainly include Comparative Genomic Hybridization (CGH) and SNP chips. This technique can only detect known CNVs.

Disclosure of Invention

The invention mainly aims to provide a method and a device for detecting copy number variation, so as to solve the problem of low sensitivity of mutation detection in the prior art.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method for detecting copy number variation, the method comprising: obtaining sequencing comparison data of a sample to be detected; calculating the sequencing depth of each base site in sequencing comparison data; dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; bins with copy numbers different from the ploidy of the designated contigs were pooled to obtain the region where germline copy number variation occurred.

Further, obtaining sequencing comparison data of the sample to be tested comprises: obtaining sequencing original data of a sample to be detected; performing quality control on sequencing original data to obtain sequencing comparison data; preferably, the quality control of the sequencing raw data to obtain the sequencing alignment data comprises: preprocessing sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; comparing the preprocessed data with a reference genome sequence to obtain comparison result data; filtering the comparison result data to remove reads with repeated comparison results to obtain sequencing comparison data; more preferably, the comparing filters the result data, further comprising filtering out reads outside the target capture area.

Further, before the copy number of each bin of the sample to be detected is calculated in a mode of dividing the reference genome into a plurality of bins, the detection method further comprises the steps of dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be detected; then calculating the copy number of each bin by using the sequencing depth after normalization; preferably, the normalization process comprises: establishing a normalization model by using a principal component analysis method according to the sequencing depth of a sample for establishing a base line in each bin; normalizing the sequencing depth of each bin in the sample to be tested by using a normalization model; preferably, the Viterbi algorithm is used to calculate the copy number of each bin of the sample to be tested, using the sequencing depth after normalization.

Further, combining bins with copy numbers different from the ploidy of the designated contigs to obtain regions with copy number variation comprises: screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.

According to a second aspect of the present application, there is provided a device for detecting copy number variation, the device comprising: the acquisition module is used for acquiring sequencing comparison data of a sample to be detected; the depth calculation module is used for calculating the sequencing depth of each base site in the sequencing comparison data; the copy number calculation module is used for dividing the reference genome into a plurality of bins and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; and the merging module is used for merging bins with copy numbers different from the ploidy of the designated contigs to obtain a region with germline copy number variation.

Further, the acquisition module includes: the acquisition submodule is used for acquiring sequencing original data of a sample to be detected; the quality control module is used for performing quality control on the sequencing original data to obtain sequencing comparison data; preferably, the quality control module comprises: the removing module is used for preprocessing sequencing raw data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; the comparison module is used for comparing the preprocessed data with the reference genome sequence to obtain comparison result data; the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results to obtain sequencing comparison data; more preferably, the quality control device further includes a second filtering module, configured to filter the comparison result data to filter and remove reads outside the target capture area.

Further, the copy number calculation module includes: the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested; a copy number calculation submodule for calculating the copy number of each bin using the normalized sequencing depth; preferably, the normalization module comprises: the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin; the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model; more preferably, the copy number calculation sub-module is a Viterbi module.

Further, the merging module includes: the screening module is used for screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.

According to a third aspect of the present application, there is provided a storage medium including a stored program, wherein the apparatus on which the storage medium is located is controlled to perform any one of the above-described copy number variation detection methods when the program is executed.

According to a fourth aspect of the present application, there is provided a processor for executing a program, wherein the program executes any one of the above methods for detecting copy number variation.

By applying the technical scheme of the invention, the ploidy (namely the copy number) of each bin is obtained in a bin-based mode, and then the ploidy of the designated contig is compared to combine the differential bins which are divided into a plurality of different bins and belong to the same chromosome of the same gene, so that the deletion or the duplication of the gene exon with the length of more than 1000bp is detected. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for detecting copy number variation according to a preferred embodiment of the present invention;

FIG. 2 is a flowchart showing details of a method for detecting copy number variation according to example 2 of the present invention;

FIG. 3 is a graph showing verification of the detection result of copy number variation of a known sample according to example 3 of the present invention;

fig. 4 is a schematic structural diagram of a copy number variation detection apparatus according to a preferred embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

Interpretation of terms:

somatic cell CNV: somatic CNV, Copy number alterations/associations (CNAs) results from changes in Copy number of Somatic tissues (e.g., tumor tissue only), and normal tissues are often required for control in assays.

Embryonic line CNV: germline CNV, Copy number alterations/associations (CNAs) results from changes in Copy number of germ line cells (and, therefore, all tissue cells).

And (5) reading: sequences generated by high throughput sequencing platforms are called reads.

Contig: the splicing software is based on the overlap region (overlap) between reads, and the sequence obtained by splicing is called contig (contig).

Designating contig: refers to contigs of the reference genome of the species to be tested. The designated contigs of human are 24 chromosomes. The ploidy of contig is specified, the ploidy of autosomal chromosomes is 2, and the ploidy of X and Y stains is 1.

Sequencing depth: the ratio of the total base number obtained by sequencing to the size of the genome to be detected is referred to. Assuming that one gene is 2M in size and 10X deep sequencing, the total amount of data obtained is 20M.

As mentioned in the background art, the existing CNV detection method can only detect the known CNV, but cannot detect other possible unknown CNVs, and therefore, in order to overcome the defect of low detection sensitivity of the prior art, the present application proposes a new improvement scheme.

Example 1

In this embodiment, a method for detecting copy number variation is provided, as shown in fig. 1, the method includes:

step S101, obtaining sequencing comparison data of a sample to be detected;

step S103, calculating the sequencing depth of each base site in the sequencing comparison data;

step S104, dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus;

step S107, bins with copy numbers different from the ploidy of the designated contig are combined to obtain a region in which germline copy number variation occurs.

The method for detecting copy number variation detects deletion or duplication of an exon of a gene having a length of more than 1000bp by using a method based on bin-based ploidy (i.e., copy number) of each bin and then combining differential bins divided into a plurality of different bins but belonging to the same chromosome of the same gene by comparing the ploidy with a designated contig. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.

It should be noted that, from the Raw Data (Raw Data) of the off-line sequencing to the time before the Raw Data can be used for Data processing, a process of obtaining valid Data (Clean Data) through preprocessing is generally required. Such a step of data preprocessing is also included in the present application. However, the pre-treatment step is slightly different depending on whether whole genome sequencing data or sequencing data of a capture library targeted to the target region is specifically to be detected. When the processed data is sequencing data from a capture library, the preprocessing step also includes a quality control step to remove reads outside the target region.

In a preferred embodiment, obtaining sequencing alignment data for a test sample comprises: obtaining sequencing original data of a sample to be detected; and performing quality control on the sequencing original data to obtain sequencing comparison data.

In some preferred embodiments, the quality control of the sequencing raw data and the obtaining of the sequencing alignment data comprises: preprocessing sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; comparing the preprocessed data with a reference genome sequence to obtain comparison result data; and filtering the comparison result data, and filtering to remove reads with repeated comparison results to obtain sequencing comparison data. When the sequencing data is whole genome sequencing data, this example is valid data after whole genome data preprocessing obtained by quality control. When the sequencing data is the sequencing data of the capture library, the embodiment is also effective data for performing conventional quality control, and although the sequencing data contains a few sequences of non-target regions, the sequencing data has little influence on the detection result.

The reads with quality lower than the threshold (i.e. low quality reads) include: reads comprising more than one base N, average sequencing quality of 5 consecutive nucleotides below a threshold, such as 20 or 30 reads. Low quality is used herein in the same sense as low quality in the conventional high throughput sequencing art and refers broadly to data that cannot be processed efficiently or that significantly adversely affects the processing results. Base N indicates that there are undetectable bases in the raw data from the sequencing. A plurality of software can detect the sequencing quality of the base in the sequencing, so that reads with the average sequencing quality of less than 20 or 30 of continuous 5 nucleotides can be conveniently screened out.

In other preferred embodiments, in the step of filtering the comparison result data, the step of filtering and removing reads outside the target capture area is further included, so that the validity of the comparison data of the target capture area is further improved, the interference of the comparison data of the non-target area is avoided, and the accuracy of subsequent analysis is improved.

In a preferred embodiment, before the reference genome is divided into a plurality of bins and the copy number of each bin of the sample to be detected is calculated, the detection method further comprises dividing the reference genome into a plurality of bins and performing normalization processing on the sequencing depth of each bin of the sample to be detected; the sequencing depth after normalization was then used to calculate the copy number of each bin.

In further preferred embodiments, the normalization process comprises: establishing a normalization model by using a principal component analysis method according to the sequencing depth of a sample for establishing a base line in each bin; normalizing the sequencing depth of each bin in the sample to be tested by using a normalization model; preferably, the Viterbi algorithm is used to calculate the copy number of each bin of the sample to be tested, using the sequencing depth after normalization.

The above-described embodiment normalizes the sequencing depth of the sample to be tested by using a sequencing depth baseline (baseline) established based on a control sample (e.g., a healthy sample) before testing, and then obtains the ploidy of each bin using the Viterbi algorithm. The Viterbi algorithm is to use a dynamic programming algorithm to obtain the maximum likelihood value of an HMM (hidden Markov rule) and determine that the state of a sample to be tested in a certain bin is neutral, missing or repeated. The HMM matrix is transformed according to the PCA model.

Principal Component Analysis (PCA) is a statistical method of data dimensionality reduction. The principle of PCA is to transform a set of variables that may be correlated into a set of linearly uncorrelated variables by orthogonal transformation, which is called the principal component. By using the method, the factors with larger influence in the multidimensional data can be extracted for analysis, so that the data processing is convenient, and the deviation of the analysis result is smaller. The process of PCA normalization is as follows: establishing a sequencing depth matrix according to divided bins (for example, T samples) and samples (for example, S samples) for constructing a base line, and establishing a PCA formula with the length of T by taking the average value of the sequencing depth of the samples of each bin, wherein each PCA formula comprises M vectors.

In a preferred embodiment, the combining bins with copy numbers different from the ploidy of the designated contigs to obtain the region with copy number variation comprises: screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.

In the CNV detection, the reference genome is divided into different bins according to predetermined values (e.g., 50bp, 100bp, etc.). The ploidy of a given contig is the one preset for each contig based on the reference genome, e.g., 22 for autosome + XY for the human genome, 2 for the autosome contig, and 1 for the X and Y chromosomes, respectively. Ploidy of contigs is assigned to be compared to the copy number of detected bins, thereby selecting for differential bins.

The above-mentioned combination refers to the combination of the same gene and the same exon divided into different regions. In the CNV detection, the reference genome is divided into different bins according to a set value (50bp, 100bp and the like), and the bins are detected in different bins, and the bins with different copy numbers (or ploidy) and the ploidy of the reference genome are obtained after detection, and the bins are combined according to the exons of the genes. The above examples combine bins of the same gene in the same chromosome that have differential ploidy, and finally report only deletions or duplications of gene exons greater than 1000bp in length.

Example 2

The present embodiment provides a specific method for detecting copy number variation, as shown in fig. 2, which includes the following steps:

1. data pre-processing

And inputting a result file in the format of NGS data fastq to generate a sequence comparison result in the format of bam.

1) And (4) preprocessing original offline data, and removing low-quality reads containing joints.

2) Comparing the processed original data with a reference genome to obtain a comparison result file in a bam format;

3) removing the sequence comparison result outside the capture chip range;

4) removing repeated reads in the comparison result file to obtain a bam file which does not contain repeated comparison results;

2. calculating the sequencing depth of each base site

And calculating the sequencing depth of each base site on the specified region in the bam file to obtain a file in a tsv format.

3. And determining the CNV-occurring region according to the sequencing depth and the designated contig ploidy.

1) Sequencing depth normalization

And establishing a model by a PCA algorithm according to the sequencing depth of the sample for establishing the base line at each site, and normalizing the depth of the sample to be analyzed.

2) The genome is divided by bin and the copy number of each bin is determined using the Viterbi algorithm.

3) Bins with different copy numbers from the set contig ploidy were combined to obtain the region in which CNV occurred.

Annotation and Filtering of CNV

The region where CNV occurs is annotated to the gene and segmented into adjacent regions of different bins according to their exon/intron mergers.

Example 3

The results of the detection of positive samples with known copy number variation by the method of example 2 are shown in FIG. 3, in which the horizontal axis represents the sample, the vertical axis represents the copy number, the dark gray color represents the positive sample result, and the light gray color represents the detection result. According to the detection result, the detection result is consistent with the known result, and the detection method can not only completely detect all the mutation sites, but also has high detection accuracy.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for causing a computing device to execute the methods according to the embodiments of the present invention or a processor to execute the methods according to the embodiments of the present invention.

Example 4

The present embodiment provides a device for detecting copy number variation, as shown in fig. 4, the device includes: the system comprises an acquisition module 20, a depth calculation module 40, a copy number calculation module 60 and a merging module 80, wherein the acquisition module 20 is used for acquiring sequencing comparison data of a sample to be detected; the depth calculation module 40 is used for calculating the sequencing depth of each base site in the sequencing alignment data; the copy number calculation module 60 is configured to calculate the copy number of each bin of the sample to be detected by using the sequencing depth of each base site in a manner of dividing the reference genome into a plurality of bins; the merging module 80 is used to merge bins with copy numbers different from the ploidy of the designated contigs, and obtain the region with copy number variation.

The detection device for copy number variation obtains the sequencing depth of each base site through the acquisition module and the depth calculation module, then adopts the copy number calculation module to obtain the ploidy (namely the copy number) of each bin in a bin mode, then compares the copy number of each bin with the ploidy of a designated contig through the merging module, and merges the differential bins which are divided into a plurality of different bins but belong to the same chromosome of the same gene, thereby detecting the deletion or the repetition of the gene exons with the length exceeding a certain proportion. Compared with the chip method in the prior art, the device has higher coverage, higher resolution and more accurate copy number evaluation when detecting the CNV, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.

In a preferred embodiment, the obtaining module includes: the acquisition submodule is used for acquiring sequencing original data of a sample to be detected; and the quality control module is used for performing quality control on the sequencing original data to obtain sequencing comparison data.

In a preferred embodiment, the quality control module comprises: the removing module is used for preprocessing sequencing raw data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; the comparison module is used for comparing the preprocessed data with the reference genome sequence to obtain comparison result data; and the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results, and obtaining sequencing comparison data.

In a preferred embodiment, the quality control device further includes a second filtering module, configured to filter the comparison result data to filter and remove reads outside the target capture area.

In a preferred embodiment, the copy number calculation module comprises: the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested; and the copy number calculation submodule is used for calculating the copy number of each bin by using the sequencing depth after normalization.

In a preferred embodiment, the normalization module comprises: the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin; the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model;

in a preferred embodiment, the copy number calculation submodule is a Viterbi block.

In a preferred embodiment, the merging module comprises: the screening module is used for screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.

Example 5

The embodiment also provides a storage medium, which comprises a stored program, wherein when the program runs, the device on which the storage medium is located is controlled to execute any one of the above methods for detecting copy number variation.

The embodiment also provides a processor, which is used for running the program, wherein when the program runs, any one of the above methods for detecting copy number variation is performed.

From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: the method and the device have the advantages that the ploidy (copy number) of each bin is obtained in a bin-based mode, particularly, the sequencing depth of a sample to be detected is normalized on the basis of a base line, then the obtained ploidy of each bin is obtained (by using a Viterbi algorithm), then the bin with difference in ploidy is found by comparing with the ploidy of a designated contig, and finally the different bins which are divided into a plurality of different bins but belong to the same chromosome of the same gene are combined, so that the deletion or the repetition of the gene exons with the length exceeding a certain proportion is detected. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting copy number variation, the method comprising:

obtaining sequencing comparison data of a sample to be detected;

calculating the sequencing depth of each base site in the sequencing comparison data;

dividing a reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base site;

combining the bins with different ploidy of the copy number and the designated contig to obtain a region with germline copy number variation;

wherein, combining the bins with copy numbers different from the ploidy of the designated contigs to obtain the region with germline copy number variation comprises:

screening bins with different ploidy of the copy number and the designated contig according to the copy number of each bin to obtain a differential bin set;

combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain the region with the germline copy number variation.

2. The method of claim 1, wherein obtaining sequencing comparison data for a sample to be tested comprises:

obtaining sequencing original data of a sample to be detected;

and performing quality control on the sequencing original data to obtain the sequencing comparison data.

3. The detection method of claim 2, wherein performing quality control on the sequencing raw data to obtain the sequencing alignment data comprises:

preprocessing the sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value;

comparing the preprocessed data with a reference genome sequence to obtain comparison result data;

and filtering the comparison result data, and filtering to remove reads with repeated comparison results to obtain the sequencing comparison data.

4. The detection method according to claim 3, wherein the comparison result data is filtered, and further comprising filtering to remove reads outside the target capture area.

5. The method according to claim 1, wherein before the calculating the copy number of each of the bins of the test sample in a manner of dividing a reference genome into a plurality of bins, the method further comprises,

dividing a reference genome into a plurality of bins, and carrying out normalization processing on the sequencing depth of each bin of a sample to be detected;

the sequencing depth after normalization was then used to calculate the copy number of each of the bins.

6. The detection method according to claim 5, wherein the normalization process comprises:

establishing a normalization model by using a principal component analysis method according to the sequencing depth of the sample for constructing the base line in each bin;

and normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model.

7. The method of claim 6, wherein the sequencing depth after normalization is used to calculate the copy number of each bin of the sample to be tested using a Viterbi algorithm.

8. An apparatus for detecting copy number variation, the apparatus comprising:

the acquisition module is used for acquiring sequencing comparison data of a sample to be detected;

the depth calculation module is used for calculating the sequencing depth of each base site in the sequencing comparison data;

the copy number calculation module is used for dividing the reference genome into a plurality of bins and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base site;

a merging module, configured to merge bins with copy numbers different from ploidy of the designated contigs to obtain a region where germline copy number variation occurs;

wherein the merging module comprises:

the screening module is used for screening the bins with different ploidy of the copy number and the designated contig according to the copy number of each bin to obtain a differential bin set;

and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain the region with the germline copy number variation.

9. The detection device according to claim 8, wherein the acquisition module comprises:

the acquisition submodule is used for acquiring sequencing original data of a sample to be detected;

and the quality control module is used for performing quality control on the sequencing original data to obtain the sequencing comparison data.

10. The detection device according to claim 9, wherein the quality control module comprises:

the removing module is used for preprocessing the sequencing original data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value;

the comparison module is used for comparing the preprocessed data with a reference genome sequence to obtain comparison result data;

and the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results, and obtaining the sequencing comparison data.

11. The detecting device for detecting the rotation of a motor rotor according to claim 10, wherein the quality control module further comprises a second filtering module for filtering the comparison result data to remove the reads outside the target capturing area.

12. The detection apparatus according to claim 8, wherein the copy number calculation module comprises:

the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested;

and the copy number calculation submodule is used for calculating the copy number of each bin by using the sequencing depth after normalization.

13. The detection apparatus according to claim 12, wherein the normalization module comprises:

the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin;

and the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by utilizing the normalization model.

14. The detection apparatus according to claim 12, wherein the copy number calculation sub-module is a Viterbi module.

15. A storage medium comprising a stored program, wherein the program, when executed, controls a device in which the storage medium is located to perform the method for detecting copy number variation according to any one of claims 1 to 7.

16. A processor configured to execute a program, wherein the program executes the method for detecting copy number variation according to any one of claims 1 to 7.