CN111599407B - Method and device for detecting copy number variation - Google Patents

Method and device for detecting copy number variation Download PDF

Info

Publication number
CN111599407B
CN111599407B CN202010403942.XA CN202010403942A CN111599407B CN 111599407 B CN111599407 B CN 111599407B CN 202010403942 A CN202010403942 A CN 202010403942A CN 111599407 B CN111599407 B CN 111599407B
Authority
CN
China
Prior art keywords
copy number
sequencing
data
bin
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010403942.XA
Other languages
Chinese (zh)
Other versions
CN111599407A (en
Inventor
曹善柏
王文平
张萌萌
郭璟
楼峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiangxin Medical Technology Co ltd
Tianjin Xiangxin Biotechnology Co ltd
Beijing Xiangxin Biotechnology Co ltd
Original Assignee
Beijing Xiangxin Medical Technology Co ltd
Tianjin Xiangxin Biotechnology Co ltd
Beijing Xiangxin Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiangxin Medical Technology Co ltd, Tianjin Xiangxin Biotechnology Co ltd, Beijing Xiangxin Biotechnology Co ltd filed Critical Beijing Xiangxin Medical Technology Co ltd
Priority to CN202010403942.XA priority Critical patent/CN111599407B/en
Publication of CN111599407A publication Critical patent/CN111599407A/en
Application granted granted Critical
Publication of CN111599407B publication Critical patent/CN111599407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention provides a method and a device for detecting copy number variation. The detection method comprises the following steps: obtaining sequencing comparison data of a sample to be detected; calculating the sequencing depth of each base site in sequencing comparison data; dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; bins with copy numbers different from the ploidy of the designated contigs were pooled to obtain the region where germline copy number variation occurred. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, can detect the copy number variation condition of certain known sites, can detect the unknown copy number variation condition and improve the detection sensitivity.

Description

Method and device for detecting copy number variation
Technical Field
The invention relates to the field of biological information analysis, in particular to a method and a device for detecting copy number variation.
Background
CNV refers to copy number polymorphisms of greater than 1kb in length, and is a type of genomic Structural Variation (SV) including deletion (deletion), insertion (insertion), duplication (duplication), and complex multi-site variation (complex multi-site variants). One of the production mechanisms of CNV is DNA recombination, including non-allelic homologous recombination (NAHR), non-homologous end-joining (NHEJ), and the like. CNV caused by DNA recombination can affect gene expression from several aspects: (1) gene dosage; (2) gene disruption; (3) gene fusion; (4) a position effect; (5) dominant recessive alleles, and the like.
The detection of CNV is currently performed by the following methods:
in the multiple ligation amplification (MLPA) technique, two adjacent probes are designed for each target gene to be detected, after the probes are paired and hybridized with a target sequence through a universal primer, the two adjacent probes are connected through a ligation reaction, and the amount of a ligation product is in direct proportion to the copy number of the target gene. The ligation product can be analyzed for gene copy number according to electrophoresis results after PCR amplification.
Chip technology, which is to make the target of interest into a microarray chip and systematically scan key regions in the genome. At present, the widely used chips mainly include Comparative Genomic Hybridization (CGH) and SNP chips. This technique can only detect known CNVs.
Disclosure of Invention
The invention mainly aims to provide a method and a device for detecting copy number variation, so as to solve the problem of low sensitivity of mutation detection in the prior art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for detecting copy number variation, the method comprising: obtaining sequencing comparison data of a sample to be detected; calculating the sequencing depth of each base site in sequencing comparison data; dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; bins with copy numbers different from the ploidy of the designated contigs were pooled to obtain the region where germline copy number variation occurred.
Further, obtaining sequencing comparison data of the sample to be tested comprises: obtaining sequencing original data of a sample to be detected; performing quality control on sequencing original data to obtain sequencing comparison data; preferably, the quality control of the sequencing raw data to obtain the sequencing alignment data comprises: preprocessing sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; comparing the preprocessed data with a reference genome sequence to obtain comparison result data; filtering the comparison result data to remove reads with repeated comparison results to obtain sequencing comparison data; more preferably, the comparing filters the result data, further comprising filtering out reads outside the target capture area.
Further, before the copy number of each bin of the sample to be detected is calculated in a mode of dividing the reference genome into a plurality of bins, the detection method further comprises the steps of dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be detected; then calculating the copy number of each bin by using the sequencing depth after normalization; preferably, the normalization process comprises: establishing a normalization model by using a principal component analysis method according to the sequencing depth of a sample for establishing a base line in each bin; normalizing the sequencing depth of each bin in the sample to be tested by using a normalization model; preferably, the Viterbi algorithm is used to calculate the copy number of each bin of the sample to be tested, using the sequencing depth after normalization.
Further, combining bins with copy numbers different from the ploidy of the designated contigs to obtain regions with copy number variation comprises: screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.
According to a second aspect of the present application, there is provided a device for detecting copy number variation, the device comprising: the acquisition module is used for acquiring sequencing comparison data of a sample to be detected; the depth calculation module is used for calculating the sequencing depth of each base site in the sequencing comparison data; the copy number calculation module is used for dividing the reference genome into a plurality of bins and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus; and the merging module is used for merging bins with copy numbers different from the ploidy of the designated contigs to obtain a region with germline copy number variation.
Further, the acquisition module includes: the acquisition submodule is used for acquiring sequencing original data of a sample to be detected; the quality control module is used for performing quality control on the sequencing original data to obtain sequencing comparison data; preferably, the quality control module comprises: the removing module is used for preprocessing sequencing raw data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; the comparison module is used for comparing the preprocessed data with the reference genome sequence to obtain comparison result data; the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results to obtain sequencing comparison data; more preferably, the quality control device further includes a second filtering module, configured to filter the comparison result data to filter and remove reads outside the target capture area.
Further, the copy number calculation module includes: the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested; a copy number calculation submodule for calculating the copy number of each bin using the normalized sequencing depth; preferably, the normalization module comprises: the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin; the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model; more preferably, the copy number calculation sub-module is a Viterbi module.
Further, the merging module includes: the screening module is used for screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.
According to a third aspect of the present application, there is provided a storage medium including a stored program, wherein the apparatus on which the storage medium is located is controlled to perform any one of the above-described copy number variation detection methods when the program is executed.
According to a fourth aspect of the present application, there is provided a processor for executing a program, wherein the program executes any one of the above methods for detecting copy number variation.
By applying the technical scheme of the invention, the ploidy (namely the copy number) of each bin is obtained in a bin-based mode, and then the ploidy of the designated contig is compared to combine the differential bins which are divided into a plurality of different bins and belong to the same chromosome of the same gene, so that the deletion or the duplication of the gene exon with the length of more than 1000bp is detected. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating a method for detecting copy number variation according to a preferred embodiment of the present invention;
FIG. 2 is a flowchart showing details of a method for detecting copy number variation according to example 2 of the present invention;
FIG. 3 is a graph showing verification of the detection result of copy number variation of a known sample according to example 3 of the present invention;
fig. 4 is a schematic structural diagram of a copy number variation detection apparatus according to a preferred embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
Interpretation of terms:
somatic cell CNV: somatic CNV, Copy number alterations/associations (CNAs) results from changes in Copy number of Somatic tissues (e.g., tumor tissue only), and normal tissues are often required for control in assays.
Embryonic line CNV: germline CNV, Copy number alterations/associations (CNAs) results from changes in Copy number of germ line cells (and, therefore, all tissue cells).
And (5) reading: sequences generated by high throughput sequencing platforms are called reads.
Contig: the splicing software is based on the overlap region (overlap) between reads, and the sequence obtained by splicing is called contig (contig).
Designating contig: refers to contigs of the reference genome of the species to be tested. The designated contigs of human are 24 chromosomes. The ploidy of contig is specified, the ploidy of autosomal chromosomes is 2, and the ploidy of X and Y stains is 1.
Sequencing depth: the ratio of the total base number obtained by sequencing to the size of the genome to be detected is referred to. Assuming that one gene is 2M in size and 10X deep sequencing, the total amount of data obtained is 20M.
As mentioned in the background art, the existing CNV detection method can only detect the known CNV, but cannot detect other possible unknown CNVs, and therefore, in order to overcome the defect of low detection sensitivity of the prior art, the present application proposes a new improvement scheme.
Example 1
In this embodiment, a method for detecting copy number variation is provided, as shown in fig. 1, the method includes:
step S101, obtaining sequencing comparison data of a sample to be detected;
step S103, calculating the sequencing depth of each base site in the sequencing comparison data;
step S104, dividing the reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base locus;
step S107, bins with copy numbers different from the ploidy of the designated contig are combined to obtain a region in which germline copy number variation occurs.
The method for detecting copy number variation detects deletion or duplication of an exon of a gene having a length of more than 1000bp by using a method based on bin-based ploidy (i.e., copy number) of each bin and then combining differential bins divided into a plurality of different bins but belonging to the same chromosome of the same gene by comparing the ploidy with a designated contig. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.
It should be noted that, from the Raw Data (Raw Data) of the off-line sequencing to the time before the Raw Data can be used for Data processing, a process of obtaining valid Data (Clean Data) through preprocessing is generally required. Such a step of data preprocessing is also included in the present application. However, the pre-treatment step is slightly different depending on whether whole genome sequencing data or sequencing data of a capture library targeted to the target region is specifically to be detected. When the processed data is sequencing data from a capture library, the preprocessing step also includes a quality control step to remove reads outside the target region.
In a preferred embodiment, obtaining sequencing alignment data for a test sample comprises: obtaining sequencing original data of a sample to be detected; and performing quality control on the sequencing original data to obtain sequencing comparison data.
In some preferred embodiments, the quality control of the sequencing raw data and the obtaining of the sequencing alignment data comprises: preprocessing sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; comparing the preprocessed data with a reference genome sequence to obtain comparison result data; and filtering the comparison result data, and filtering to remove reads with repeated comparison results to obtain sequencing comparison data. When the sequencing data is whole genome sequencing data, this example is valid data after whole genome data preprocessing obtained by quality control. When the sequencing data is the sequencing data of the capture library, the embodiment is also effective data for performing conventional quality control, and although the sequencing data contains a few sequences of non-target regions, the sequencing data has little influence on the detection result.
The reads with quality lower than the threshold (i.e. low quality reads) include: reads comprising more than one base N, average sequencing quality of 5 consecutive nucleotides below a threshold, such as 20 or 30 reads. Low quality is used herein in the same sense as low quality in the conventional high throughput sequencing art and refers broadly to data that cannot be processed efficiently or that significantly adversely affects the processing results. Base N indicates that there are undetectable bases in the raw data from the sequencing. A plurality of software can detect the sequencing quality of the base in the sequencing, so that reads with the average sequencing quality of less than 20 or 30 of continuous 5 nucleotides can be conveniently screened out.
In other preferred embodiments, in the step of filtering the comparison result data, the step of filtering and removing reads outside the target capture area is further included, so that the validity of the comparison data of the target capture area is further improved, the interference of the comparison data of the non-target area is avoided, and the accuracy of subsequent analysis is improved.
In a preferred embodiment, before the reference genome is divided into a plurality of bins and the copy number of each bin of the sample to be detected is calculated, the detection method further comprises dividing the reference genome into a plurality of bins and performing normalization processing on the sequencing depth of each bin of the sample to be detected; the sequencing depth after normalization was then used to calculate the copy number of each bin.
In further preferred embodiments, the normalization process comprises: establishing a normalization model by using a principal component analysis method according to the sequencing depth of a sample for establishing a base line in each bin; normalizing the sequencing depth of each bin in the sample to be tested by using a normalization model; preferably, the Viterbi algorithm is used to calculate the copy number of each bin of the sample to be tested, using the sequencing depth after normalization.
The above-described embodiment normalizes the sequencing depth of the sample to be tested by using a sequencing depth baseline (baseline) established based on a control sample (e.g., a healthy sample) before testing, and then obtains the ploidy of each bin using the Viterbi algorithm. The Viterbi algorithm is to use a dynamic programming algorithm to obtain the maximum likelihood value of an HMM (hidden Markov rule) and determine that the state of a sample to be tested in a certain bin is neutral, missing or repeated. The HMM matrix is transformed according to the PCA model.
Principal Component Analysis (PCA) is a statistical method of data dimensionality reduction. The principle of PCA is to transform a set of variables that may be correlated into a set of linearly uncorrelated variables by orthogonal transformation, which is called the principal component. By using the method, the factors with larger influence in the multidimensional data can be extracted for analysis, so that the data processing is convenient, and the deviation of the analysis result is smaller. The process of PCA normalization is as follows: establishing a sequencing depth matrix according to divided bins (for example, T samples) and samples (for example, S samples) for constructing a base line, and establishing a PCA formula with the length of T by taking the average value of the sequencing depth of the samples of each bin, wherein each PCA formula comprises M vectors.
In a preferred embodiment, the combining bins with copy numbers different from the ploidy of the designated contigs to obtain the region with copy number variation comprises: screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.
In the CNV detection, the reference genome is divided into different bins according to predetermined values (e.g., 50bp, 100bp, etc.). The ploidy of a given contig is the one preset for each contig based on the reference genome, e.g., 22 for autosome + XY for the human genome, 2 for the autosome contig, and 1 for the X and Y chromosomes, respectively. Ploidy of contigs is assigned to be compared to the copy number of detected bins, thereby selecting for differential bins.
The above-mentioned combination refers to the combination of the same gene and the same exon divided into different regions. In the CNV detection, the reference genome is divided into different bins according to a set value (50bp, 100bp and the like), and the bins are detected in different bins, and the bins with different copy numbers (or ploidy) and the ploidy of the reference genome are obtained after detection, and the bins are combined according to the exons of the genes. The above examples combine bins of the same gene in the same chromosome that have differential ploidy, and finally report only deletions or duplications of gene exons greater than 1000bp in length.
Example 2
The present embodiment provides a specific method for detecting copy number variation, as shown in fig. 2, which includes the following steps:
1. data pre-processing
And inputting a result file in the format of NGS data fastq to generate a sequence comparison result in the format of bam.
1) And (4) preprocessing original offline data, and removing low-quality reads containing joints.
2) Comparing the processed original data with a reference genome to obtain a comparison result file in a bam format;
3) removing the sequence comparison result outside the capture chip range;
4) removing repeated reads in the comparison result file to obtain a bam file which does not contain repeated comparison results;
2. calculating the sequencing depth of each base site
And calculating the sequencing depth of each base site on the specified region in the bam file to obtain a file in a tsv format.
3. And determining the CNV-occurring region according to the sequencing depth and the designated contig ploidy.
1) Sequencing depth normalization
And establishing a model by a PCA algorithm according to the sequencing depth of the sample for establishing the base line at each site, and normalizing the depth of the sample to be analyzed.
2) The genome is divided by bin and the copy number of each bin is determined using the Viterbi algorithm.
3) Bins with different copy numbers from the set contig ploidy were combined to obtain the region in which CNV occurred.
Annotation and Filtering of CNV
The region where CNV occurs is annotated to the gene and segmented into adjacent regions of different bins according to their exon/intron mergers.
Example 3
The results of the detection of positive samples with known copy number variation by the method of example 2 are shown in FIG. 3, in which the horizontal axis represents the sample, the vertical axis represents the copy number, the dark gray color represents the positive sample result, and the light gray color represents the detection result. According to the detection result, the detection result is consistent with the known result, and the detection method can not only completely detect all the mutation sites, but also has high detection accuracy.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for causing a computing device to execute the methods according to the embodiments of the present invention or a processor to execute the methods according to the embodiments of the present invention.
Example 4
The present embodiment provides a device for detecting copy number variation, as shown in fig. 4, the device includes: the system comprises an acquisition module 20, a depth calculation module 40, a copy number calculation module 60 and a merging module 80, wherein the acquisition module 20 is used for acquiring sequencing comparison data of a sample to be detected; the depth calculation module 40 is used for calculating the sequencing depth of each base site in the sequencing alignment data; the copy number calculation module 60 is configured to calculate the copy number of each bin of the sample to be detected by using the sequencing depth of each base site in a manner of dividing the reference genome into a plurality of bins; the merging module 80 is used to merge bins with copy numbers different from the ploidy of the designated contigs, and obtain the region with copy number variation.
The detection device for copy number variation obtains the sequencing depth of each base site through the acquisition module and the depth calculation module, then adopts the copy number calculation module to obtain the ploidy (namely the copy number) of each bin in a bin mode, then compares the copy number of each bin with the ploidy of a designated contig through the merging module, and merges the differential bins which are divided into a plurality of different bins but belong to the same chromosome of the same gene, thereby detecting the deletion or the repetition of the gene exons with the length exceeding a certain proportion. Compared with the chip method in the prior art, the device has higher coverage, higher resolution and more accurate copy number evaluation when detecting the CNV, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.
In a preferred embodiment, the obtaining module includes: the acquisition submodule is used for acquiring sequencing original data of a sample to be detected; and the quality control module is used for performing quality control on the sequencing original data to obtain sequencing comparison data.
In a preferred embodiment, the quality control module comprises: the removing module is used for preprocessing sequencing raw data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value; the comparison module is used for comparing the preprocessed data with the reference genome sequence to obtain comparison result data; and the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results, and obtaining sequencing comparison data.
In a preferred embodiment, the quality control device further includes a second filtering module, configured to filter the comparison result data to filter and remove reads outside the target capture area.
In a preferred embodiment, the copy number calculation module comprises: the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested; and the copy number calculation submodule is used for calculating the copy number of each bin by using the sequencing depth after normalization.
In a preferred embodiment, the normalization module comprises: the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin; the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model;
in a preferred embodiment, the copy number calculation submodule is a Viterbi block.
In a preferred embodiment, the merging module comprises: the screening module is used for screening bins with copy numbers different from the ploidy of the designated contig according to the copy number of each bin to obtain a differential bin set; and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain a region with copy number variation.
Example 5
The embodiment also provides a storage medium, which comprises a stored program, wherein when the program runs, the device on which the storage medium is located is controlled to execute any one of the above methods for detecting copy number variation.
The embodiment also provides a processor, which is used for running the program, wherein when the program runs, any one of the above methods for detecting copy number variation is performed.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: the method and the device have the advantages that the ploidy (copy number) of each bin is obtained in a bin-based mode, particularly, the sequencing depth of a sample to be detected is normalized on the basis of a base line, then the obtained ploidy of each bin is obtained (by using a Viterbi algorithm), then the bin with difference in ploidy is found by comparing with the ploidy of a designated contig, and finally the different bins which are divided into a plurality of different bins but belong to the same chromosome of the same gene are combined, so that the deletion or the repetition of the gene exons with the length exceeding a certain proportion is detected. Compared with the chip method in the prior art, the method for detecting the CNV has higher coverage, higher resolution and more accurate copy number evaluation, not only can detect the copy number variation condition of certain known sites, but also can detect the unknown copy number variation condition, and improves the detection sensitivity.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (16)

1. A method for detecting copy number variation, the method comprising:
obtaining sequencing comparison data of a sample to be detected;
calculating the sequencing depth of each base site in the sequencing comparison data;
dividing a reference genome into a plurality of bins, and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base site;
combining the bins with different ploidy of the copy number and the designated contig to obtain a region with germline copy number variation;
wherein, combining the bins with copy numbers different from the ploidy of the designated contigs to obtain the region with germline copy number variation comprises:
screening bins with different ploidy of the copy number and the designated contig according to the copy number of each bin to obtain a differential bin set;
combining a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain the region with the germline copy number variation.
2. The method of claim 1, wherein obtaining sequencing comparison data for a sample to be tested comprises:
obtaining sequencing original data of a sample to be detected;
and performing quality control on the sequencing original data to obtain the sequencing comparison data.
3. The detection method of claim 2, wherein performing quality control on the sequencing raw data to obtain the sequencing alignment data comprises:
preprocessing the sequencing raw data, and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value;
comparing the preprocessed data with a reference genome sequence to obtain comparison result data;
and filtering the comparison result data, and filtering to remove reads with repeated comparison results to obtain the sequencing comparison data.
4. The detection method according to claim 3, wherein the comparison result data is filtered, and further comprising filtering to remove reads outside the target capture area.
5. The method according to claim 1, wherein before the calculating the copy number of each of the bins of the test sample in a manner of dividing a reference genome into a plurality of bins, the method further comprises,
dividing a reference genome into a plurality of bins, and carrying out normalization processing on the sequencing depth of each bin of a sample to be detected;
the sequencing depth after normalization was then used to calculate the copy number of each of the bins.
6. The detection method according to claim 5, wherein the normalization process comprises:
establishing a normalization model by using a principal component analysis method according to the sequencing depth of the sample for constructing the base line in each bin;
and normalizing the sequencing depth of each bin in the sample to be tested by using the normalization model.
7. The method of claim 6, wherein the sequencing depth after normalization is used to calculate the copy number of each bin of the sample to be tested using a Viterbi algorithm.
8. An apparatus for detecting copy number variation, the apparatus comprising:
the acquisition module is used for acquiring sequencing comparison data of a sample to be detected;
the depth calculation module is used for calculating the sequencing depth of each base site in the sequencing comparison data;
the copy number calculation module is used for dividing the reference genome into a plurality of bins and calculating the copy number of each bin of the sample to be detected by using the sequencing depth of each base site;
a merging module, configured to merge bins with copy numbers different from ploidy of the designated contigs to obtain a region where germline copy number variation occurs;
wherein the merging module comprises:
the screening module is used for screening the bins with different ploidy of the copy number and the designated contig according to the copy number of each bin to obtain a differential bin set;
and the merging submodule is used for merging a plurality of different bins belonging to the same exon of the same gene in the differential bin set to obtain the region with the germline copy number variation.
9. The detection device according to claim 8, wherein the acquisition module comprises:
the acquisition submodule is used for acquiring sequencing original data of a sample to be detected;
and the quality control module is used for performing quality control on the sequencing original data to obtain the sequencing comparison data.
10. The detection device according to claim 9, wherein the quality control module comprises:
the removing module is used for preprocessing the sequencing original data and removing at least one of the following reads: (1) reads containing a linker; (2) obtaining preprocessed data by reads with the quality lower than a threshold value;
the comparison module is used for comparing the preprocessed data with a reference genome sequence to obtain comparison result data;
and the first filtering module is used for filtering the comparison result data, filtering and removing reads with repeated comparison results, and obtaining the sequencing comparison data.
11. The detecting device for detecting the rotation of a motor rotor according to claim 10, wherein the quality control module further comprises a second filtering module for filtering the comparison result data to remove the reads outside the target capturing area.
12. The detection apparatus according to claim 8, wherein the copy number calculation module comprises:
the normalization module is used for dividing the reference genome into a plurality of bins and carrying out normalization processing on the sequencing depth of each bin of the sample to be tested;
and the copy number calculation submodule is used for calculating the copy number of each bin by using the sequencing depth after normalization.
13. The detection apparatus according to claim 12, wherein the normalization module comprises:
the model establishing module is used for establishing a normalization model by utilizing a principal component analysis method according to the sequencing depth of the sample for establishing the base line in each bin;
and the normalization submodule is used for normalizing the sequencing depth of each bin in the sample to be tested by utilizing the normalization model.
14. The detection apparatus according to claim 12, wherein the copy number calculation sub-module is a Viterbi module.
15. A storage medium comprising a stored program, wherein the program, when executed, controls a device in which the storage medium is located to perform the method for detecting copy number variation according to any one of claims 1 to 7.
16. A processor configured to execute a program, wherein the program executes the method for detecting copy number variation according to any one of claims 1 to 7.
CN202010403942.XA 2020-05-13 2020-05-13 Method and device for detecting copy number variation Active CN111599407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010403942.XA CN111599407B (en) 2020-05-13 2020-05-13 Method and device for detecting copy number variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010403942.XA CN111599407B (en) 2020-05-13 2020-05-13 Method and device for detecting copy number variation

Publications (2)

Publication Number Publication Date
CN111599407A CN111599407A (en) 2020-08-28
CN111599407B true CN111599407B (en) 2021-10-15

Family

ID=72182406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010403942.XA Active CN111599407B (en) 2020-05-13 2020-05-13 Method and device for detecting copy number variation

Country Status (1)

Country Link
CN (1) CN111599407B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766428B (en) * 2021-04-08 2021-07-02 臻和(北京)生物科技有限公司 Tumor molecule typing method and device, terminal device and readable storage medium
CN113257360B (en) * 2021-06-24 2021-10-15 北京橡鑫生物科技有限公司 Cancer screening model, and construction method and construction device of cancer screening model
CN113337501B (en) * 2021-08-06 2022-02-18 北京橡鑫生物科技有限公司 Hairpin type joint and application thereof in double-end index library construction
CN113789371B (en) * 2021-09-17 2024-09-10 广州燃石医学检验所有限公司 Batch correction-based copy number variation detection method
CN113674803B (en) * 2021-08-30 2023-08-08 广州燃石医学检验所有限公司 Copy number variation detection method, device, storage medium and application thereof
CN114400046B (en) * 2022-03-08 2022-12-13 北京吉因加医学检验实验室有限公司 Method and device for detecting gene copy number variation based on probe superposition
CN117095744A (en) * 2023-08-21 2023-11-21 上海信诺佰世医学检验有限公司 Copy number variation detection method based on single-sample high-throughput transcriptome sequencing data
CN116978453B (en) * 2023-09-22 2024-01-23 北京诺禾致源科技股份有限公司 Method and electronic device for judging authenticity of fusion gene
CN117935907B (en) * 2024-01-31 2024-09-03 苏州贝康医疗器械有限公司 Method and device for detecting copy number variation of true and false genes

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256292A (en) * 2016-12-29 2018-07-06 安诺优达基因科技(北京)有限公司 A kind of copy number variation detection device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256292A (en) * 2016-12-29 2018-07-06 安诺优达基因科技(北京)有限公司 A kind of copy number variation detection device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向多核平台的拷贝数变异检测方法及并行算法研究;刘绍颉;《中国优秀硕士学位论文全文数据库医药卫生科技辑》;20170331;第11-24页 *

Also Published As

Publication number Publication date
CN111599407A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN111599407B (en) Method and device for detecting copy number variation
CN110029157B (en) Method for detecting haploid copy number variation of tumor single cell genome
CN104302781B (en) A kind of method and device detecting chromosomal structural abnormality
Schrider Background selection does not mimic the patterns of genetic diversity produced by selective sweeps
CN114999573B (en) Genome variation detection method and detection system
Staaf et al. Normalization of array-CGH data: influence of copy number imbalances
CN111863125B (en) Method for detecting single parent diploid based on NGS-trio and application
CN108304694B (en) Method for analyzing gene mutation based on second-generation sequencing data
CN105046105B (en) The Haplotype map and its construction method of chromosome span
CN112233722B (en) Variety identification method, and method and device for constructing prediction model thereof
CN106650254A (en) Method for detecting fusion gene based on transcriptome sequencing data
CN116386718B (en) Method, apparatus and medium for detecting copy number variation
Ahsan et al. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data
CN113724781B (en) Method and apparatus for detecting homozygous deletions
CN114921536A (en) Method, device, storage medium and equipment for detecting uniparental diploid and loss of heterozygosity
CN111210873A (en) Exon sequencing data-based copy number variation detection method and system, terminal and storage medium
JP7333838B2 (en) Systems, computer programs and methods for determining genetic patterns in embryos
WO2003074739A2 (en) Automated allele determination using fluorometric genotyping
EP2977466B1 (en) Detecting chromosomal aneuploidy
CN107208152B (en) Method and apparatus for detecting mutant clusters
CN104598775B (en) A kind of rna editing event recognition method
CN114974415A (en) Method and device for detecting chromosome copy number abnormality
WO2022027212A1 (en) Method for detecting uniparental disomy on basis of ngs-trio and use thereof
CN109390039B (en) Method, device and storage medium for counting DNA copy number information
CN114703263B (en) Group chromosome copy number variation detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant