CN114613434A

CN114613434A - Method and system for detecting gene copy number variation based on population sample depth information

Info

Publication number: CN114613434A
Application number: CN202011422203.1A
Authority: CN
Inventors: 张通达; 王琳; 尹珍珍; 杨颖�; 李建标; 郭健; 金鑫
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-10

Abstract

The invention belongs to the technical field of biology, and particularly discloses a method and a system for detecting gene copy number variation based on population sample depth information. The method comprises the following steps: (1) obtaining sequencing data for a plurality of samples; (2) calculating the copy ratio of the region to be detected relative to the whole genome; (3) sorting and grouping the copy ratio values of the plurality of samples from small to large; (4) and determining the nearest copy number reference ratio according to the distance between the copy ratio and the copy number reference ratio, and determining the copy number of each packet. The method accurately detects different types of CNV based on the depth information of the population sample, and calculates the copy number value; CNV with common frequency can be accurately detected, and the specific copy number including DUP can be normally calculated; different types of CNVs in the population sample can be detected and correctly typed.

Description

Method and system for detecting gene copy number variation based on population sample depth information

Technical Field

The invention relates to the technical field of biology, in particular to a method and a system for detecting gene copy number variation based on population sample depth information.

Background

Gene Copy Number Variation (CNV) detection is generally based on several sequencing information available: 1, comparing the assembled sequence with a reference genomic sequence based on sequence assembly information; 2, comparing the sequencing depth of the target region with the surrounding region and the region of the control population to see whether a difference exists or not based on the sequencing depth information; 3, based on the sequencing read length information, based on the insert length of the paired end sequencing (PE) read length alignment results and the read length can be cut out to align to different positions. A CNV Detection Method (PSCC Method for short) is proposed in the article PSCC (sensitive and reactive position-Scale Copy Number Variation Detection Method) in the prior art, and the Method calculates CNV based on Sequencing depth.

However, the method in the prior art can only detect rare CNVs, and cannot detect CNVs with common frequencies, because the algorithm for detecting CNVs by using the sequencing depth method can treat the population sample as a normal control to ensure accuracy, i.e. as a case without variation, so that the accuracy is improved but the sensitivity is sacrificed, especially the CNVs with common frequencies; or common frequency CNV is detected but with poor accuracy; the method in the prior art cannot accurately and correctly classify the population hereditary CNV; furthermore, the prior art methods do not refine the specific copy number for copy number amplification (DUP).

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a system for detecting gene copy number variation, which effectively solve the problems of common frequency CNV detection, population hereditary CNV typing and DUP copy number refinement.

Accordingly, in one aspect, the present invention provides a method for detecting gene copy number variation based on population sample depth information, the method comprising:

(1) obtaining sequencing data for a plurality of samples;

(2) for each sample, calculating the average sequencing depth of the region to be detected, calculating the average sequencing depth of the whole genome, and dividing the average sequencing depth of the region to be detected by the average depth of the whole genome to obtain the copy ratio of the region relative to the whole genome;

(3) sorting the copy ratio values of the plurality of samples from small to large, sequentially comparing each value with a next value from the minimum value, and combining the samples into a group if the difference between the two values is less than a first threshold value, and grouping all the samples;

(4) for each packet, determining the closest copy number reference ratio according to the distance between the copy number reference ratio and the copy number reference ratio, and determining the copy number of each packet.

In one embodiment, in (1), the plurality of samples is more than 50 samples.

In one embodiment, in (2), the average sequencing depth of the region to be detected is corrected by GC.

In one embodiment, in (3), the first threshold is less than 0.15, more preferably less than 0.12, and most preferably less than 0.1.

In one embodiment, in (3), the packets for which the number of samples is less than the second threshold are removed, and the packets for which the number of samples is greater than the second threshold but the distribution does not conform to a normal distribution are removed.

In one embodiment, determining whether the packets conform to a normal distribution includes calculating a mean, a maximum, a minimum, and a variance of the copy ratio values for each packet.

In one embodiment, the second threshold is greater than 25, preferably greater than 30.

In one embodiment, in (4), the copy number reference ratio is a set of arithmetic progression starting from 0, arithmetic progression is 1/N, N is a species ploidy number, preferably the sample is from a 2-ploid species, and the discrete number is in 0.5 steps starting from 0. For example, the copy number reference ratios include 0, 0.5, 1, 1.5, 2.

In one embodiment, in (4), the copy number to which each packet belongs is determined according to the distance between the average value of the copy ratio values of the packets and the copy number reference ratio, and preferably, the formula for determining the copy number to which each packet belongs is as follows: and the copy number is N, the copy number closest to the reference ratio, and N is the ploidy value of the species.

In another aspect, the present invention provides a system for detecting gene copy number variation based on population sample depth information, the system comprising:

the sequencing data acquisition module is used for acquiring sequencing data of the sample;

the copy ratio calculation module is used for calculating the average sequencing depth of the region to be detected based on the sequencing data of the sample, calculating the average sequencing depth of the whole genome, and dividing the average depth of the region to be detected by the average depth of the whole genome to obtain the copy ratio of the region relative to the whole genome;

the sample copy ratio grouping module is used for sequencing the copy ratios of the samples from small to large, sequentially comparing each value with the next value from the minimum value, combining the samples into a group if the difference between the two values is smaller than a first threshold value, and grouping all the samples;

a copy number determining module, configured to determine a closest reference ratio of the copy number according to a distance between the reference ratio of the copy number and the reference ratio of the copy number of the packet, and determine the copy number of the packet, where preferably, the formula for determining the copy number to which each packet belongs is: and the copy number is N, the copy number closest to the reference ratio, and N is the ploidy value of the species.

In one embodiment, the system further comprises a packet verification module for performing a normal distribution test for packets having a number of samples greater than a second threshold.

In one embodiment, the sequencing data acquisition module comprises a sequencing instrument.

In one embodiment, the copy ratio calculation module is further configured to perform GC correction on the average sequencing depth of the region to be detected.

In one embodiment, the first threshold is less than 0.15, more preferably less than 0.12, and most preferably less than 0.1.

In one embodiment, the packet check module is configured to calculate a mean, a maximum, a minimum, and a variance of the copy ratio values of the packets.

In one embodiment, the second threshold is greater than 25, preferably greater than 30, more preferably greater than 50.

In one embodiment, the copy number reference ratio is a set of arithmetic series starting from 0, arithmetic is 1/N, N is a species ploidy number, preferably, the sample is from a 2-ploid species, and the discrete number is in 0.5 steps starting from 0. For example, the copy number reference ratios include 0, 0.5, 1, 1.5, 2.

In one embodiment, the copy number determination module determines the copy number to which each packet belongs according to a distance between an average value of the copy ratio values of the packets and a copy number reference ratio value, and preferably, the formula for determining the copy number to which each packet belongs is as follows: and the copy number is N, the copy number closest to the reference ratio, and N is the ploidy value of the species.

The method of the invention can be closer to the real situation, accurately detect different types of CNV based on the depth information of the population sample, and calculate the copy number value. The method of the invention can accurately detect the CNV with common frequency and normally calculate the specific numerical value of the copy number including DUP. The method can detect different types of CNV of the group sample and correctly classify.

Drawings

The invention is illustrated by the following figures.

Fig. 1 exemplarily shows 790 samples of the regional copy ratio distribution.

FIG. 2 shows exemplary depth and copy ratio analysis of the chr8:39226335 and 39388919 regions of a sample SZCH 0056.

Detailed Description

Without wishing to be bound by any theory, the inventors hypothesize that the target region may have different types of copy numbers in the detection sample and the population sample, such as CN0, CN1, CN2, etc., and for the same type, such as CN2, the sequencing depth ratio calculated by the sample of the type in the target region should be subject to normal distribution, and the distribution of the samples with different copy numbers is obviously different.

Firstly, calculating the sum of the sequencing depths of the target regions, dividing the sum of the sequencing depths of the target regions by the length of the target regions to obtain the average sequencing depth of the regions, and preferably correcting the average sequencing depth of the regions by GC. The target region is at least 500bp in length, may be tens of thousands of bp in length, and is preferably a region known to have copy number variation, a region predicted to have copy number variation, or a candidate region having copy number variation. For long target regions, the target region may be segmented, for example, each segment is 500bp or 1 kb, and the segmented target regions are combined with adjacent segments having the same copy number variation in the calculation result by using the method or system of the present invention, so as to obtain the total segment length of the copy number variation. The correction method of GC can be seen in the correction method of PSCC, which is briefly described as follows: and counting the depth value and the GC base ratio of each window by dividing the windows, and taking the median value of the depth values of the windows as the depth value under the GC ratio for the windows with the same GC ratio. And performing Lasso regression on all GC proportions and corresponding depth values to obtain the GC proportions and the corresponding regression depth values, and normalizing the actual sequencing depth based on the regression depth of the GC proportions in the region. Meanwhile, the average sequencing depth of the whole genome of each sample is calculated, the average sequencing depth of the whole genome can be all sequencing sequences of one sample, the whole genome of a species is not required to be covered, and in fact, a large number of repeated sequences exist in the genome, and the sequence is not suitable for sequencing. Then, dividing the average sequencing depth of the region to be detected by the average sequencing depth of the whole genome to obtain the copy ratio of the target region relative to the whole genome.

Then, sorting the copy ratio values of all samples from small to large, starting from the sample copy ratio value with the minimum value, and combining the two sample copy ratio values into a group if the difference between the second sample copy ratio value and the first sample copy ratio value is not more than 0.1; then comparing the copy ratio of the third sample with the copy ratio of the second sample, if the size does not exceed 0.1, combining the samples into a group, and repeating the combination in the same way; and when the difference value of the copy ratio of the next sample is larger than 0.1 compared with the copy ratio of the previous sample, the current grouping is cut off, and the next grouping is restarted until all the copy ratios of the samples are grouped.

Then, performing normal distribution test quality control on all groups with the number of samples larger than 30; the mean, maximum, minimum and variance of the copy ratio values for each set of samples were calculated. The distance between the mean of the copy ratio values for each set of samples and the distance of each set of means from a reference point (e.g., 0, 0.5, 1, 1.5, 2, etc. for diploid species) are also calculated. For an N-ploid species, the discrete number of steps may be different from a 2-ploid species, for example, stepped at 0/N, 1/N, 2/N, 3/N. N may be 1. For sperm or eggs, the genome of which may be understood as haploid, N may be considered to be 1, and the reference points may include 0, 1, 2, 3, 4 … ….

Finally, judging the copy number of the target area of the sample in each group according to the value of the reference point closest to the average value of each group; for an N-ploid species, copy number can be determined by assignment of the fold and mean of its chromosomes. Preferably, the formula for determining the copy number to which each packet belongs is as follows: and the copy number is N, the copy number closest to the reference ratio, and N is the ploidy value of the species. For example, for a diploid species, the copy number is 0 if the average of a group is close to 0, 1 if it is close to 0.5, and so on.

In another aspect, the present invention can be presented in a systematic manner. For example, the invention relates to a system for detecting gene copy number variation based on population sample depth information, which comprises a sequencing data acquisition module, a copy ratio calculation module, a sample copy ratio grouping module, a copy number determination module and a preferred grouping check module.

In the present invention, the sequencing data acquisition module is used for acquiring sequencing data of a sample. The sequencing data acquisition module comprises a sequencing instrument, and the sequencing instrument can sequence a sample to obtain sequencing data. The sequencing data acquisition module may also acquire sequencing data of the sample from elsewhere, such as sequencing data stored on a server local or remote to the computer, or sequencing data stored on media such as compact disks, floppy disks, hard disks.

In the invention, the copy ratio calculation module is used for calculating the average sequencing depth of the region to be detected based on the sequencing data of the sample, calculating the average sequencing depth of the whole genome, and dividing the average depth of the region to be detected by the average depth of the whole genome to obtain the copy ratio of the region relative to the whole genome. The sequencing depth can be understood as the ratio of the total base number (bp) obtained by sequencing to the Genome size (Genome), which is one of the indexes for evaluating the sequencing quantity. The sequencing depth, i.e., the ratio of the amount of bases (bp) to the size of a sequence obtained from sequencing a sequence, can be calculated for a sequence on the genome.

In the invention, the sample copy ratio grouping module is used for sorting the copy ratios of a plurality of samples from small to large, comparing each value with the next value in sequence from the minimum value, combining the samples into a group if the difference between the two values is less than a first threshold value, and grouping all the samples. Grouping sample copy ratio values is actually separating samples with different copy numbers for the target area so that samples with the same copy number are in the same group. The selection of the threshold is important, the threshold which is too small can be divided into too many groups, and samples originally in one group are divided into different groups; too large a threshold will separate samples originally in different groups into one group, and in either case will bias the variation in copy number of the gene detected. In one embodiment of the invention, the first threshold is less than 0.15, more preferably less than 0.12, most preferably less than 0.1; preferably, the above threshold is applied to 2-ploid species. For sperm or eggs, the genome of which may be understood as haploid, the first threshold may be less than 0.2. For an N-ploid species, for example, where N is greater than 2, the first threshold, unlike a 2-ploid species, may be less than 0.05, or even less than 0.03.

In the present invention, the packet verification module is configured to perform normal distribution verification on packets whose number of samples is greater than a second threshold. In the present invention, the inventors found that the copy number of the population sample obtained by the deep calculation is significantly different from that of the normal distribution in the same genotype or that in different genotypes, and based on this, the copy number variation can be detected or judged. The method of the invention can be calculated based on the copy ratio and also based on the number of copies. The method of the present invention can be used to detect CNV and also to identify the true and false of CNV. The normal distribution needs a large number of samples, and the normal distribution of the packets is difficult to observe by the sample size of the single digit. In one embodiment, the second threshold is greater than 25, preferably greater than 30, more preferably greater than 50. And the index of the normal distribution test is used for assisting the accuracy of the judgment result, if the grouping is less than 25, the grouping is not tested, and the output of the auxiliary index is null.

In the present invention, the copy number determination module is configured to determine a closest copy number reference ratio according to a distance between the copy ratio of the packet and the copy number reference ratio, and determine the closest copy number reference ratio as the copy number of the packet. The mean, maximum, minimum and variance of the copy ratio values for each set of samples were calculated. The distance between the mean of the copy ratio values for each set of samples and the distance of each set of means from a reference point (e.g., 0, 0.5, 1, 1.5, 2, etc. for diploid species) are also calculated. Finally, judging the copy number of the target area of the sample in each group according to the value of the reference point closest to the average value of each group; for an N-ploid species, copy number can be determined by assignment of the fold and mean of its chromosomes. For example, for a diploid species, the copy number is 0 if the average of a group is close to 0, 1 if it is close to 0.5, and so on.

The copy ratio calculation module, the sample copy ratio grouping module, the grouping check module and the copy number determination module may be implemented by a computer program, for example, by programming a voice, writing a computer executable program according to a calculation formula of a sequencing depth, and then by computer hardware.

It will be understood by those skilled in the art that the division and order of the steps in the method for detecting gene copy number variation based on population sample depth information according to the present invention are merely illustrative and not restrictive, and that those skilled in the art may make omissions, additions, substitutions, modifications and changes without departing from the spirit and scope of the present invention as set forth in the appended claims and their equivalents.

The invention may be implemented as a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the steps of the method of the invention to be performed. In one embodiment, the computer program is distributed across a plurality of computer devices or processors coupled by a network such that the computer program is stored, accessed, and executed by one or more computer devices or processors in a distributed fashion. A single method step/operation, or two or more method steps/operations, may be performed by a single computer device or processor or by two or more computer devices or processors. One or more method steps/operations may be performed by one or more computer devices or processors, and one or more other method steps/operations may be performed by one or more other computer devices or processors. One or more computer devices or processors may perform a single method step/operation, or perform two or more method steps/operations.

Example 1

The present invention is exemplified by the human chr20 chromosome 1561287-1594229 as the target region.

First, for the sequencing data of 790 samples, for each sample, the total sequencing depth of the target region is calculated, the average sequencing depth of the region is obtained by dividing the total sequencing depth of the target region by the length of the target region, and the average sequencing depth of the region is corrected by GC. Meanwhile, the average sequencing depth of the whole genome of each sample is calculated, the average sequencing depth of the target region to be detected is divided by the average sequencing depth of the whole genome to obtain the copy ratio of the target region relative to the whole genome, and the copy ratio distribution of the target region of 790 samples is shown in figure 1;

then, the copy ratio values of all samples are sorted from small to large, starting from the sample copy ratio value with the minimum value, and if the difference between the second sample copy ratio value and the first sample copy ratio value is not more than 0.1, combining the two sample copy ratio values into a group; then comparing the copy ratio of the third sample with the copy ratio of the second sample, if the size does not exceed 0.1, combining the samples into a group, and repeating the combination in the same way; when the comparison difference value of the copy ratio of the next sample and the copy ratio of the previous sample is larger than 0.1, the current grouping is cut off, and the next grouping is restarted until all the copy ratios of the samples are grouped;

then, performing normal distribution test quality control on all groups with the number of samples larger than 30; the mean, maximum, minimum and variance of the copy ratio values for each set of samples were calculated. The distance between the averages of the copy ratio values for each set of samples and the distance between the averages of each set and a reference point (0, 0.5, 1, 1.5, 2, etc.) were also calculated;

finally, the target region copy ratio values of 790 samples are divided into three groups; the first set of mean values 0.067, standard deviation 0.013, and the second set of distance 0.561, are closer to 0 in the reference point, so all samples in the set have 0 copies of the target region; the second set of mean values 0.627, standard deviation 0.031, is closer to the first set of distances 0.561, to the third set of distances 0.560, to 0.5 in the reference point, so that all samples in the set have 1 copy of the region; the third set of mean values 1.187, standard deviation 0.569, is closer to the second set of distance 0.560 than 1 in the reference point, so that the region is 2 copies for all samples in the set.

Example 2

The region chr8:39226335 and 39388919 of the SZCH0056 sample is analyzed for sequencing depth and copy proportion, and the distribution of the population sample is shown in FIG. 2. The sample copy ratio was 0.022, and the population was distributed around the 0 reference frame, so the copy number was 0. The area is detected but not detected by using a PSCC algorithm, and is verified to be true by a CMA chip of Cytoscan750K of Affymetrix company.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

While the invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the construction and methods of the embodiments described above. On the contrary, the invention is intended to cover various modifications and equivalent arrangements. In addition, while the various elements and method steps disclosed herein are shown in various example combinations and configurations, other combinations, including more, less or all, of the elements or method steps are also within the scope of the invention.

Claims

1. A method of detecting gene copy number variation based on population sample depth information, the method comprising:

(1) obtaining sequencing data for a plurality of samples;

(3) sorting the copy ratio values of the plurality of samples from small to large, comparing each value with the next value in turn from the minimum value, and if the difference between the two values is smaller than a first threshold, combining the samples into a group, and grouping all the samples, wherein the first threshold is preferably smaller than 0.15, more preferably smaller than 0.12, and most preferably smaller than 0.1;

2. The method according to claim 1, in (3), the packets with the number of samples smaller than the second threshold and the packets with the number of samples larger than the second threshold but the distribution not conforming to the normal distribution are removed, preferably the second threshold is larger than 25, preferably larger than 30.

3. The method according to claim 1 or 2, wherein in (2), the average sequencing depth of the region to be detected is corrected by GC.

4. The method of any one of claims 1 to 3, wherein in (4), the copy number reference ratio is a set of arithmetic sequence starting from 0, arithmetic sequence is 1/N, N is a species ploidy number, preferably the sample is from 2 ploidy species, and the discrete number is in 0.5 order starting from 0, for example the copy number reference ratio comprises 0, 0.5, 1, 1.5, 2.

5. The method according to any one of claims 1 to 4, wherein in (4), the copy number to which each packet belongs is determined according to the distance between the average value of the copy ratio values of the packets and the copy number reference ratio value, preferably, the formula for determining the copy number to which each packet belongs is as follows: and the copy number is N, which is the nearest copy number reference ratio, and N is the ploidy of the species.

6. A system for detecting gene copy number variation based on population sample depth information, the system comprising:

a sample copy ratio grouping module, configured to sort the copy ratio of multiple samples from small to large, compare each value with a subsequent value in turn from a minimum value, and if a difference between the previous value and the subsequent value is smaller than a first threshold, combine the samples into a group, and group all the samples, where the first threshold is preferably smaller than 0.15, more preferably smaller than 0.12, and most preferably smaller than 0.1;

and the copy number determining module is used for determining the nearest copy number reference ratio according to the distance between the copy ratio of the packet and the copy number reference ratio, and determining the nearest copy number reference ratio as the copy number of the packet.

7. The system according to claim 6, further comprising a packet verification module for performing a normal distribution test for packets with a number of samples greater than a second threshold, preferably the second threshold is greater than 25, preferably greater than 30, more preferably greater than 50.

8. The system of claim 6 or 7, wherein the copy ratio calculation module is further configured to perform GC correction on the average sequencing depth of the region to be detected.

9. The system of any one of claims 6 to 8, wherein the copy number reference ratio is a set of arithmetic discrete number series starting from 0, the arithmetic differential number is 1/N, N is a species ploidy number, preferably the sample is from a 2-ploid species, and the discrete number is in 0.5 order starting from 0, for example the copy number reference ratio comprises 0, 0.5, 1, 1.5, 2.

10. The system according to any one of claims 6 to 9, wherein the copy number determination module determines the copy number to which each packet belongs according to the distance between the average value of the copy ratio of the packets and the copy number reference ratio, and preferably, the formula for determining the copy number to which each packet belongs is: and the copy number is N, the copy number closest to the reference ratio, and N is the ploidy value of the species.