CN110910954A

CN110910954A - Method and system for detecting low-depth whole genome gene copy number variation

Info

Publication number: CN110910954A
Application number: CN201911224400.XA
Authority: CN
Inventors: 顾丽朋; 陈珺
Original assignee: Hunan Jieyi Medical Laboratory Co Ltd; Shanghai Jieyi Biotechnology Co ltd
Current assignee: Hunan Jieyi Medical Laboratory Co Ltd; Shanghai Jieyi Biotechnology Co ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-03-24

Abstract

The invention provides a method and a system for detecting low-depth whole genome gene copy number variation, wherein the method comprises the following steps: comparing the sample to be detected to the genome according to the control sample; counting data blocks on a genome and the data volume of each data block; calculating the similarity of two adjacent data blocks, and if the two data blocks are similar, performing aggregation and iteration circularly until all the data blocks on the genome can not be aggregated; and calculating the ratio values of the to-be-detected sample and the control sample of each data block, and finding out abnormal data blocks, namely candidate gene copy number variation according to a preset normal value range. The system comprises: the system comprises a data preprocessing module, a data counting module, a data aggregation module and a gene copy number variation determining module which are connected in sequence. By the method and the system for detecting the low-depth whole genome gene copy number variation, the problems of low accuracy and low timeliness are solved.

Description

Method and system for detecting low-depth whole genome gene copy number variation

Technical Field

The invention relates to the technical field of gene copy number variation, in particular to a method and a system for detecting low-depth whole genome gene copy number variation.

Background

Due to the continuous development of the second generation sequencing technology (NGS), the use of NGS to detect gene Copy Number Variation (CNV) has better superiority than the chip method. Typically, high depth paired-end whole genome sequencing has high sensitivity and resolution for CNV detection. However, the cost of high-depth Whole Gene Sequencing (WGS) is very expensive, so that the detection of CNV by low-depth WGS is in operation, but the corresponding CNV detection method is relatively slow, and even the method for detecting high-depth WGS is directly applied to the data of low-depth WGS, the timeliness and the accuracy of the analysis result are not satisfactory.

Disclosure of Invention

The invention provides a method and a system for detecting low-depth whole genome gene copy number variation, which are superior to other detection methods and can solve the problems of low accuracy and low timeliness.

According to the first aspect of the invention, a method for detecting low-depth genome-wide gene copy number variation is provided, which comprises the following steps:

s11: data preprocessing, namely comparing the sample to be detected to a genome according to the control sample;

s12: data statistics, wherein data blocks on the genome and the data volume of each data block are counted;

s13: calculating the similarity of two adjacent data blocks, and if the similarity is similar, performing aggregation and performing loop iteration until all the data blocks on the genome can not be aggregated;

s14: and calculating the ratio values of the to-be-detected sample and the control sample of each data block, and finding out abnormal data blocks, namely candidate gene copy number variation according to a preset normal value range.

Optionally, the generation manner of the control sample aligned to the genome in S11 is: randomly extracting 1/N data from each sample in N samples, and combining N pieces of sampling data after extraction is finished to obtain a final control sample.

Optionally, the step of counting the data blocks on the genome in S12 specifically includes: and counting the area where the data appears or appears simultaneously in one of the sample to be detected and the control sample.

Optionally, in S13, the similarity between two adjacent data blocks is calculated, and if the similarity is similar, the aggregation specifically includes: calculating whether the likelihood values of two adjacent data blocks are similar, and if so, aggregating; further, the likelihood value formula is:

wherein n is the data volume of the sample to be detected, m is the data volume of the control sample, T is the total data volume of the sample to be detected on the whole genome, and C is the total data volume of the control sample on the whole genome.

Optionally, in S14, the formula for calculating the ratio values of the to-be-measured sample and the control sample of each data block is as follows:

optionally, the method for determining the normal value range preset in S14 includes: and determining a preset normal value range by using a box chart method by taking the multiple sequencing result of the control sample as a control.

Optionally, before aligning the test sample to the genome according to the control sample in S11, the method further includes: performing quality control on a sample to be detected; and/or the presence of a gas in the gas,

after the comparison of the test sample to the genome according to the control sample further comprises: the genome is sorted and/or deduplicated.

According to a second aspect of the present invention, there is provided a low-depth genome-wide gene copy number variation detection system for implementing the method for detecting low-depth genome-wide gene copy number variation, comprising: the system comprises a data preprocessing module, a data counting module, a data aggregation module and a gene copy number variation determining module which are connected in sequence; wherein the content of the first and second substances,

the data preprocessing module is used for performing quality control on a sample to be detected, comparing the sample to a genome according to a comparison sample, sequencing and removing duplication;

the data statistics module is used for counting the data blocks on the genome and the data volume of each data block;

the data set module is used for calculating the similarity of two adjacent data blocks, and if the similarity is similar, the data blocks are aggregated and iterated circularly until all the data blocks on the genome can not be aggregated;

the gene copy number variation determining module is used for calculating the ratio values of the to-be-detected sample and the comparison sample of each data block, and finding out abnormal data blocks according to a preset normal value range, namely candidate gene copy number variations.

Compared with the prior art, the invention has the following beneficial effects:

(1) according to the method and the system for detecting the copy number variation of the low-depth genome-wide gene, provided by the invention, similar data regions can be accurately connected through a data block division mode of similar data block aggregation, so that candidate CNV regions are divided;

(2) according to the method and the system for detecting the low-depth genome-wide gene copy number variation, the control sample is generated in a control generation mode, compared with a real sample, the generated control is more ideal, the distribution of data on a genome is more even, and the data are close to 'perfect non-CNV'. The detection result with the method as a control has higher accuracy, and the analysis deviation caused by a large amount of CNV carried by a single real sample or experimental and sequencing errors is overcome;

(3) according to the method and the system for detecting the low-depth genome-wide gene copy number variation, data appear in one of a sample to be detected and a reference sample or areas with data appearing at the same time are counted, the data do not need to be counted in a desert area without data, consumption of statistics and calculation resources is greatly reduced, operation efficiency is improved, and compared with other methods, operation time is shortened to be several times or even tens of times;

(4) according to the method and the system for detecting the copy number variation of the low-depth whole genome gene, provided by the invention, the normal range of the ratio value is determined by using the box chart method by taking the result of multiple sequencing of the sample as a reference, so that the potential CNV is accurately found, and the detection result is more accurate and reliable.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for detecting low-depth whole gene copy number variation according to an embodiment of the present invention;

FIG. 2 is a data statistics approach of a prior art method;

FIG. 3 shows data statistics according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

FIG. 1 is a flow chart of a method for detecting low-depth whole gene copy number variation according to an embodiment of the present invention.

Referring to fig. 1, the detection method of this embodiment includes the following steps:

s12: data statistics, wherein data blocks on a genome and the data volume of each data block are counted;

s13: calculating the similarity of two adjacent data blocks, and if the two data blocks are similar, performing aggregation and iteration circularly until all the data blocks on the genome can not be aggregated;

s14: and calculating the ratio of the sample to be detected to the reference sample of each data block, and finding out abnormal data blocks according to a preset normal value range, namely candidate gene copy number variation.

In a preferred embodiment, before the step of comparing the test sample to the genome according to the control sample in step S11, the method further comprises: performing quality control on a sample to be detected; and/or, after aligning the test sample to the genome according to the control sample, further comprising: the genome is sorted and/or deduplicated.

In an embodiment, in the step S11, fastqc software (the fastqc software is a second-generation sequencing data quality assessment software) is used to detect whether the sequencing data quality is good or not for quality control of the sample to be tested. Alignment to genome from control samples data was posted back to genome using bwa software (bwa software is a piece of software that aligns sequences to a reference genome). The sorting uses samtools software (samtools is a tool set for operating sam and bam files, and sam and bam are data format files). Deduplication employs a picard package (to remove duplicate sequences in the bam file) to remove redundant data due to experimentation.

In a preferred embodiment, the alignment to genome control sample in S11 is generated by: randomly extracting 1/N data from each sample in N samples, and combining N pieces of sampling data after extraction is finished to obtain a final control sample. Such as: 20 samples, 1/20 random samples are taken from each sample, and this step uses the down sample of samtools to achieve random sampling. After the extraction was completed, 20 samples of the sampled data were combined to form a virtual natural sample that was infinitely close to a standard sample without CNV and the data covered the entire genome more uniformly than the actual single control sample. It is therefore a control sample close to ideal. The method is used as a control to detect the CNV carried in the real sample to be detected, has higher accuracy, and overcomes the analysis deviation caused by the fact that a real single sample introduces the sample for the control to carry a large amount of CNV or experimental and sequencing errors. It should be noted that the embodiment may be modified based on any of the above embodiments.

In a preferred embodiment, the block (block) of the statistical genome in S12 is specifically: and counting the area in which data appears or simultaneously appears in one of the sample to be detected and the comparison sample, and not counting the desert area without data. Compared with the statistical approach of window by window of the traditional method, as shown in FIG. 2, the whole genome is uniformly divided into windows of the same size, i.e., regions; in this embodiment, a block by block data statistics manner is adopted, and a "what you see is what you get" principle is adopted, so that a data block (a starting point, an end point, and a data amount, that is, a reads number) measured in a genome is truly counted, as shown in fig. 3, thereby greatly reducing consumption of statistics and calculation resources. It should be noted that the embodiment may be modified based on any of the above embodiments.

In the preferred embodiment, the similarity between two adjacent data blocks is calculated in S13, and if the similarity is similar, the aggregation specifically includes: calculating whether the likelihood values (L1, L2) of two adjacent data blocks are similar (namely, the correlation is high, the closer the two data blocks are, the better the two data blocks are, the same data blocks can be obtained, and the difference between the two data blocks can be known to be within a preset smaller range), and if so, combining 2 data blocks into a large block (the minimum starting position is a new start, the maximum ending position is a new end, and the data amount in the large block is the sum of the data amounts of block1 and block 2), so that the adjacent and similar data in the whole genome can be accurately combined, and finally divided areas are adjacent and not similar to each other; further, the likelihood value formula is:

wherein n is the data volume of the sample to be detected, m is the data volume of the control sample, T is the total data volume of the sample to be detected on the whole genome, and C is the total data volume of the control sample on the whole genome. Further, in S14, the ratio value of the to-be-measured sample and the control sample of each data block is calculated by the following formula:

it should be noted that the embodiment may be modified based on any of the above embodiments.

In the preferred embodiment, the method for determining the normal value range preset in S14 is as follows: and determining a preset normal value range by using a box chart method by taking the multiple sequencing result of the control sample as a control. Such as: and taking two sequencing results of the control sample as a control, theoretically, each block should have a ratio value of 1, namely the content of case and control data on the block is 1: 1. However, the data comes from 2 times of sequencing, may be influenced by sequencing deviation or experimental error, and may also have the condition of abnormal value, so that the boxplot method is facilitated, the range of normal value (influenced by the abnormal value is small) can be accurately divided, the defined normal range is more accurate, and the condition of missing detection caused by the threshold value of the empirical definition theory is avoided. In an embodiment, 5 groups of samples are adopted, 2-time on-machine sequencing is respectively carried out, 2-time on-machine data of each sample is obtained, then ratio value calculation of the scheme is carried out on each sample, ratio values of all blocks are defined in a normal value range by a box chart method, so that the upper limit of ratio is 1.4, the lower limit of ratio is 0.6, when a sample detects CNV, a block which is not in the normal value range is a potential CNV site, according to genomics, the ratio of a normal block is 1, so that a ratio value is greater than 1.4, namely CNV gain, namely, duplication, and a ratio value is less than 0.6, namely deletion, namely CNV loss. It should be noted that the embodiment may be modified based on any of the above embodiments.

The invention also provides a detection system of low-depth whole gene copy number variation, which is used for realizing the detection method of low-depth whole gene copy number variation in the embodiment. In one embodiment, it comprises: the system comprises a data preprocessing module, a data counting module, a data aggregation module and a gene copy number variation determining module which are connected in sequence; wherein the content of the first and second substances,

the data preprocessing module is used for performing quality control on the samples to be detected, comparing the samples to the genome according to the comparison samples, sequencing and removing the duplication;

the data statistics module is used for counting data blocks on the genome and the data volume of each data block;

the gene copy number variation determining module is used for calculating the ratio values of the to-be-detected sample and the comparison sample of each data block, and finding out abnormal data blocks according to a preset normal value range, namely candidate gene copy number variation.

In summary, the method and system for detecting low-depth genome-wide gene copy number variation provided by the invention can accurately connect similar data regions through a data block division mode of similar data block aggregation, thereby dividing candidate CNV regions.

Meanwhile, the invention generates the control sample by a control generation mode, and compared with a real sample, the generated control is more ideal, and the distribution of data on the genome is more even and is close to 'perfect no CNV'. The detection result with the method as the control has higher accuracy, and overcomes the analysis deviation caused by a single real sample with a large amount of CNV or experimental and sequencing errors.

In addition, the invention does not count the desert area without data by counting the area where data appears on one of the sample to be detected and the comparison sample or data appears at the same time, thereby greatly reducing the consumption of statistics and calculation resources and improving the operation efficiency.

In addition, the invention uses the multiple sequencing results of the sample as a contrast, and utilizes a box chart method to determine the normal range of the ratio value, thereby accurately finding out the potential CNV, leading the detection result to be more accurate and reliable, and leading the detection specificity of the CNV with the length of more than 10KB to be as high as more than 95 percent.

The method has the advantages that the calculation speed and the accuracy of the result are well guaranteed in each link, and compared with other methods, the method is faster, more efficient and more accurate.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting low-depth genome-wide gene copy number variation is characterized by comprising the following steps:

2. The method for detecting low-depth genome-wide gene copy number variation according to claim 1, wherein the control sample aligned into the genome in S11 is generated in a manner that: randomly extracting 1/N data from each sample in N samples, and combining N pieces of sampling data after extraction is finished to obtain a final control sample.

3. The method for detecting low-depth genome-wide gene copy number variation according to claim 1, wherein the step of S12 comprises the steps of: and counting the area where the data appears or appears simultaneously in one of the sample to be detected and the control sample.

4. The method for detecting low-depth genome-wide gene copy number variation according to claim 2, wherein the step of S12 comprises the steps of: and counting the area where the data appears or appears simultaneously in one of the sample to be detected and the control sample.

5. The method for detecting copy number variation of a low-depth genome-wide gene as claimed in any one of claims 1 to 4, wherein the similarity between two adjacent data blocks is calculated in S13, and the similarity is aggregated as follows: calculating whether the likelihood values of two adjacent data blocks are similar, and if so, aggregating; further, the likelihood value formula is:

6. The method for detecting copy number variation of a low-depth genome-wide gene according to any one of claims 1 to 4, wherein the predetermined normal value range in S14 is determined by: and determining a preset normal value range by using a box chart method by taking the multiple sequencing result of the control sample as a control.

7. The method for detecting copy number variation of a low-depth genome-wide gene according to claim 5, wherein the determination method of the normal value range preset in S14 comprises: and determining a preset normal value range by using a box chart method by taking the multiple sequencing result of the control sample as a control.

8. The method for detecting low-depth genome-wide gene copy number variation according to claim 6, wherein the step of comparing the test sample to the genome according to the control sample in S11 further comprises: performing quality control on a sample to be detected; and/or the presence of a gas in the gas,

9. The method for detecting low-depth genome-wide gene copy number variation according to claim 7, wherein the step of comparing the test sample to the genome according to the control sample in step S11 further comprises: performing quality control on a sample to be detected; and/or the presence of a gas in the gas,

10. A detection system for detecting low-depth genome-wide gene copy number variation, which is a detection system for implementing the detection method for low-depth genome-wide gene copy number variation according to any one of claims 1 to 9, and which comprises: the system comprises a data preprocessing module, a data counting module, a data aggregation module and a gene copy number variation determining module which are connected in sequence; wherein the content of the first and second substances,