CN108319813B

CN108319813B - Method and device for detecting circulating tumor DNA copy number variation

Info

Publication number: CN108319813B
Application number: CN201711321172.9A
Authority: CN
Inventors: 郭昊; 韩天澄; 于佳宁; 林小静; 宋雪
Original assignee: Wuxi Zhenhe Biotechnology Co ltd
Current assignee: Wuxi Zhenhe Biotechnology Co ltd
Priority date: 2017-11-30
Filing date: 2017-12-12
Publication date: 2020-12-25
Anticipated expiration: 2037-12-12
Also published as: CN108319813A

Abstract

The invention discloses a method and a device for detecting circulating tumor DNA copy number variation. Wherein, the method comprises the following steps: determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data; acquiring a first capture interval parameter and a second capture interval parameter; respectively carrying out standardization processing on the first capture interval parameter and the second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter; determining a first copy number variation score according to the first depth; determining a second copy number variation score from the first depth and the second depth; and determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score. The invention solves the technical problem of poor accuracy of circulating tumor DNA copy number variation detection in the prior art.

Description

Method and device for detecting circulating tumor DNA copy number variation

Technical Field

The invention relates to the field of genes, in particular to a method and a device for detecting circulating tumor DNA copy number variation.

Background

Circulating tumor DNA (ctDNA) is extracellular DNA in a cell-free state, and is present in body fluids such as blood, synovial fluid, pleural and peritoneal fluid, and cerebrospinal fluid, and particularly, is abundant in plasma free DNA (cfDNA). It is mainly composed of single-stranded or double-stranded DNA and a mixture of single-stranded and double-stranded DNA. It is a tumor marker with wide application prospect, high sensitivity and high specificity, and is suitable for various tumors. Tumors can be staged, prognostic, and dynamically monitored by ctDNA. The tumor information carried by the device can also be used for medication guidance, and the device can be used for sampling easily and mastering the tumor information more comprehensively. At present, the liquid biopsy technology aiming at the plasma cfDNA of a tumor patient has important monitoring and guiding significance on clinical tumor treatment and guiding of targeted drug administration.

Copy Number Variation (CNV) of DNA fragments is a common structural variation of the genome, and is ubiquitous in the human population. CNV of certain specific genes is considered as a clinical index of tumor progression and prognosis, and has the potential of guiding drug administration of tumor patients. Currently, the common methods for detecting CNV mainly include two main types of experimental methods: a low-pass molecular biology experimental technique, comprising: chromosome banding, Fluorescence In Situ Hybridization (FISH), droplet digital PCR (ddPCR), and the like; and high-throughput next generation gene sequencing (NGS), which can probe CNVs of DNA fragments in the genome-wide range or target gene region.

However, the above several copy number variation detection methods all have certain defects, and the following specific description is made in sequence:

the main defects of the low-flux molecular biology experimental technology comprise: low resolution, complex operation, low detection flux, large deviation, specifically,

chromosome banding: low resolution, no variation less than 5Mb in length, and low accuracy when the cell division index is low or the chromosome morphology phenotype is poor. Moreover, cell culture is needed in operation, which is time-consuming;

FISH: at present, the gold standard method of gene CNV for clinical pathological examination is adopted. Compared with chromosome banding, although the resolution is greatly improved, the number of probes is limited, the flux is small, and the time is long. Only a few sites can be analyzed per test, and prior judgment on the type position of the CNV is needed before the test, so that the method is only suitable for verification aiming at individual sites and is not suitable for screening. In addition, different laboratories and detection mechanisms have large deviations on the result judgment;

ddPCR: when the copy number variation of the gene to be detected is calculated, points except the position to be detected are selected as reference points, the copy multiple of the gene to be detected relative to the reference gene is calculated, and if the copy number of the reference gene is not 2 (the genome is assumed to be diploid), the copy number detection result of the gene to be detected also generates deviation. Compared with capture sequencing, the method has the advantages of small flux, limited selectable sites of the probe, short probe coverage area, difficulty in confirming the breakpoint position, capability of detecting the gene in the designed probe area, high cost and suitability for verification rather than screening.

In contrast, capture sequencing has higher sensitivity and higher resolution in CNV detection of tumor tissue samples, but the analysis process is more complicated. High throughput capture sequencing detects major drawbacks of CNV technology: an algorithm for detecting CNV by capturing, sequencing and aiming at ctDNA is not specially adopted, and the existing analysis method has the disadvantages of complex process, low sensitivity and poor result stability. For example:

CNVkit: the output result has no prompt statistical significance and is difficult to interpret. For long genes, it is often the case that individual genes are divided into different CNV segments and the estimated copy number is different. In addition, the CNVkit only gives an estimation of the relative variation amplitude (log2Ratio) of the copy number, whether the fragment is CNV or not is not checked, and a user needs to set a threshold value for judgment;

Control-FREEC: the final output result is the copy number is rounded, so the detection effect of the sample with unclear tumor cell ratio or low ratio is not good, and the algorithm can not detect the CNV with the copy number below 2.5, which is not good for the CNV detection of ctDNA. Although the software can calculate the purity of the tumor DNA and use this percentage to correct the final copy number results, the final estimated copy number results may deviate significantly from the actual situation due to inaccuracies in the estimation of purity. Therefore, this software is not suitable for the identification of ctDNA.

Therefore, the existing analysis software aiming at the captured data is only partially suitable for sequencing the whole exon, and the software suitable for capturing the data of non-whole exons has some defects. No software is designed for the ctDNA detection CNV, and the sensitivity is low for samples with fine amplification or deletion multiples. Because the content proportion of ctDNA in the plasma cfDNA is small, and the content is usually less than 1%, the development of a CNV detection method with high sensitivity and high accuracy is urgently needed.

Aiming at the problem of poor accuracy of circulating tumor DNA copy number variation detection in the prior art, no effective solution is provided at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for detecting copy number variation of circulating tumor DNA, which at least solve the technical problem of poor detection accuracy of the copy number variation of the circulating tumor DNA in the prior art.

According to an aspect of the embodiments of the present invention, there is provided a method for detecting a variation in copy number of circulating tumor DNA, comprising: determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data; acquiring a first capture interval parameter and a second capture interval parameter, wherein the first capture interval parameter is used for representing the parameter of the normal population sequencing result on the capture interval, and the second capture interval parameter is used for representing the parameter of the target sequencing result to be detected on the capture interval; respectively carrying out standardization processing on the first capture interval parameter and the second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter; determining a first copy number variation score according to the first depth, wherein the first copy number variation score is used for representing the fluctuation range of the copy number variation of the gene data of the normal population; determining a second copy number variation score according to the first depth and the second depth, wherein the second copy number variation score is used for representing the fluctuation condition of the copy number variation of the target gene data to be detected; and comparing the second copy number variation score with the first copy number variation score to determine the copy number variation detection result of the target gene data to be detected.

According to another aspect of the embodiments of the present invention, there is also provided a device for detecting copy number variation, including: the determining module is used for determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data; the acquisition module is used for acquiring a first capture interval parameter and a second capture interval parameter, wherein the first capture interval parameter is used for representing the parameter of the normal population sequencing result on the capture interval, and the second capture interval parameter is used for representing the parameter of the target to be tested sequencing result on the capture interval; the standardization processing module is used for respectively carrying out standardization processing on the first capture interval parameter and the second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter; the first scoring module is used for determining a first copy number variation score according to the first depth, wherein the first copy number variation score is used for representing the fluctuation range of the copy number variation of the normal population gene data; the second scoring module is used for determining a second copy number variation score according to the first depth and the second depth, wherein the second copy number variation score is used for representing the fluctuation condition of the copy number variation of the target gene data to be detected; and the detection scoring module is used for determining a copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein when the program is executed, the apparatus on which the storage medium is located is controlled to execute the method for detecting copy number variation of circulating tumor DNA.

According to another aspect of the embodiments of the present invention, there is also provided a processor for executing a program, wherein the program is executed to perform the method for detecting copy number variation of circulating tumor DNA.

In the embodiment of the invention, the gene copy number variation score of the normal population is calculated, and the copy number variation score range of the normal population is determined according to the gene copy number variation score of the normal population, so that the gene copy number of the target to be detected is detected according to the range determined by the copy number variation score, the technical problem of poor detection accuracy of the circulating tumor DNA copy number variation in the prior art caused by setting an improper copy number threshold is solved, and the purpose of improving the detection accuracy of the circulating tumor DNA copy number variation is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of detecting copy number variation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method for detecting copy number variation of circulating tumor DNA according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram showing the results of a CNV identification according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an apparatus for detecting copy number variation according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the following, some terms appearing in the following examples of the present application are explained for the sake of understanding the present solution:

ctDNA: circulating tumor DNA, tumor cells actively secrete DNA fragments which have undergone gene mutation into body fluid during division and proliferation.

GC content: among the four bases of DNA, the ratio of guanine and cytosine is the GC content.

A bam file: and the binary compression format of the sam file is used for storing the result of the sequence sequencing back attached to the reference genome, wherein the sam is a sequence alignment format used for storing the result of the sequence sequencing back attached to the reference genome.

The bed file: the bed file comprises at least 4 columns of information: chromosome information, genome starting position, genome ending position, and gene information (such as gene name), wherein each line of the target capture interval bed file represents a capture interval.

reads: sequencing reads the length, genome or transcriptome sequence fragments detected by the sequencer.

A fasta: a text-based format for representing a nucleotide sequence or an amino acid sequence.

fastq: one common type of high-throughput sequencing file is the type of file in which raw sequencing data is stored.

bwa: an alignment method software is used for searching the position of a sequencing sequence in a human gene reference sequence and outputting a bam format result file.

GCS (gene-specific-score): statistics of statistical information for a gene at each capture interval are summarized for gene-specific statistical testing.

samtools: a tool for processing bam/sam files.

picard: a tool for processing high-throughput sequencing data can be used for processing a sam/bam equal ratio result file.

Comparing quality: for quantifying the likelihood of alignment to a wrong location, a higher value indicates a lower likelihood.

Example 1

In accordance with an embodiment of the present invention, there is provided an embodiment of a method for detecting copy number variation, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.

Fig. 1 is a flowchart of a method for detecting copy number variation according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

and step S11, determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data.

Specifically, the normal population gene data can be capture sequencing fastq data of healthy populations, the target to be detected can be any one of the populations to be detected, and the target gene data to be detected can be capture sequencing fastq data of a cfDNA sample of a tumor patient to be detected.

It should be noted that the gene data of the normal population or the target to be detected may be extracted from blood of the normal population or the target to be detected, or may be extracted from other body fluids such as ascites, pleural effusion, cerebrospinal fluid and the like of the normal population or the target to be detected, which is not limited herein.

In an optional embodiment, capturing and sequencing fastq data of a healthy population and capturing and sequencing fastq data of a plasma cfDNA sample of a tumor patient to be detected are obtained, the two fastq data are respectively compared with preset human reference genome fasta data, sequencing and marking repeated processing are carried out on comparison results, a normal population sequencing result and a target sequencing result to be detected are obtained, and both the normal population sequencing result and the target sequencing result to be detected can be bam files.

And step S13, acquiring a first capture interval parameter and a second capture interval parameter, wherein the first capture interval parameter is used for representing the parameter of the normal population sequencing result on the capture interval, and the second capture interval parameter is used for representing the parameter of the target to be tested sequencing result on the capture interval.

Specifically, the first capture interval parameter and the first capture interval parameter may both include: number of sequencing Reads (RC), GC content, and interval length.

In an alternative embodiment, the foregoing steps may be executed by the bed file processing module provided in this embodiment, the bed file processing module inputs the target capture interval bed file and the human reference genome sequence fasta file, and the bed file processing module processes the target capture interval bed file, so that the GC content, the interval length, and the located gene information of each capture interval can be obtained, and the target capture interval bed file is sorted according to the chromosome and the genome position.

The specific treatment method can be as follows: the human reference genome sequence (fasta file) contains sequence composition information of all human chromosomes, including A, T, C, G base distribution and position information of each base. The bed file processing module can stick each capture interval in the bed file back to the human reference genome sequence, and count the length between the start point and the end point of the capture interval (the end point minus the start point), the GC content percentage of the sequence (the number of G and C bases and/length of the capture interval), and the module can simultaneously extract the unique gene name of each capture interval to obtain the gene information.

Step S15, normalizing the first capture interval parameter and the second capture interval parameter, respectively, to obtain a first capture interval parameter depth and a second capture interval depth.

In particular, the normalization process is used to reduce the impact between multiple capture interval parameters. In an alternative embodiment, the capturing interval parameters including RC, GC content, and interval length are used to illustrate that the RC can be normalized by the GC content and interval length in the capturing interval parameters.

Step S17, determining a first copy number variation score according to the first depth, wherein the first copy number variation score is used to characterize a fluctuation range of the copy number variation of the normal population gene data.

Specifically, the above steps are used to establish a baseline level of copy number for the normal population. The normal population has a plurality of normal individual, and the fluctuation range of the normal population gene data with copy number variation can be obtained according to the first copy number variation score of each normal individual in the normal population.

And step S19, determining a second copy number variation score according to the first depth and the second depth, wherein the second copy number variation score is used for representing the fluctuation condition of the copy number variation of the target gene data to be detected.

Specifically, in the above step, the second copy number variation score is determined by both the first depth and the second depth, so that the deviation degree of the copy number of the target to be detected in the capture interval from the baseline level of the normal population can be determined, and the second copy number variation score is used as a reference for detecting the copy number variation of the target to be detected.

And step S111, comparing the second copy number variation score with the first copy number variation score, and determining the copy number variation detection result of the target gene data to be detected.

It should be noted here that in interpreting the results of DNA copy number variation, the user is actually more concerned about the gene level copy number mutation. At present, the final output results of all copy number variation identification software do not take genes as copy number variation units, but determine the breakpoint of each mutation through an algorithm, so that the copy number variation results aiming at the genes cannot be given. In addition, when the software is used for identification, the method mainly depends on setting a copy number threshold value, and whether variation exists is judged according to whether the copy number of the sample to be detected exceeds the threshold value, so that the method is very unfavorable for identifying the cfDNA sample with low ctDNA concentration, and the sensitivity is very low. The scheme of the invention takes the distribution situation of the copy number in the population as a standard, and identifies whether the copy number variation exists or not through statistical test. When the sample gene passes through statistical test and proves that the copy number of a sample to be tested and a reference human group on a certain gene has a significant difference, the identification result of copy number mutation can be given, so that the problem of low detection sensitivity caused by setting an improper copy number threshold value under the condition of unknown ctDNA concentration is solved, and each copy number mutation identified by the sample gene can be ensured to be reliable.

As can be seen from the above, in the embodiments of the present application, the gene copy number variation score of the normal population is calculated, and the score range of the copy number variation of the normal population is determined according to the gene copy number variation score of the normal population, so that the gene copy number of the target to be detected is detected according to the range determined by the copy number variation score, and thus the technical problem of poor detection accuracy of the copy number variation of the circulating tumor DNA in the prior art due to setting of an inappropriate copy number threshold is solved, and the purpose of improving the detection accuracy of the copy number variation of the circulating tumor DNA is achieved.

Optionally, according to the above embodiment of the present application, the capturing interval parameter includes: the number of sequencing reads, the GC content and the interval length, wherein the corresponding depths of the number of sequencing reads, the GC content and the interval length are respectively as follows: the method comprises the following steps of respectively standardizing a first capture interval parameter and a second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter, wherein the coverage, the GC content depth and the interval length depth comprise:

step S151, the number of sequencing reads of each capture interval is standardized to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter.

Step S153, carrying out standardization processing on the GC content according to the coverage and the GC content to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter.

Step S155, standardizing the section length according to the GC content depth and the section length to obtain the section length depth of the first capture section parameter and the section length depth of the second capture section parameter.

In the above steps, the autosomal sequencing overlay data are normalized, GC content normalized and capture interval length normalized. The method adopts RC as the measurement of the coverage of the capture interval, and is influenced by GC content and the length of the capture interval. In order to avoid bias error caused by estimation of coverage by two factors, the method calculates the GC content and the interval length of each capturing interval, and utilizes the information to carry out standardization processing on the RC.

It should be noted here that there exists in the prior art authentication software that uses a group of people as a reference sample, but it usually adopts a median or a mean value to summarize the coverage of the reference group on each segment into a numerical value, and although this numerical value can reflect a part of the information of the overall distribution, the information related to the fluctuation of the coverage in the group of people is lost, and when the number of the reference group samples is more, the lost information is more. In the scheme of the application, the coverage is obtained by performing standardized processing on the sequencing read-length data, and the information is reserved, so that the deviation degree of the target to be detected and the reference population group baseline level can be measured more accurately, and the more population group samples, the more obvious the advantages of the identification software in the prior art are.

Next, the normalization method of the above three data will be described in order.

Optionally, according to the foregoing embodiment of the present application, the normalizing the number of sequencing reads of each capture interval to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter includes:

step S1511, standardizing the number of sequencing reads of each capture interval by the following formula to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter:

DepthA_i＝log2(RC_i)-log2(RC_m)；

wherein, DepthA_iIs the coverage after the sequencing read length number target normalization of the ith capture interval; RC (resistor-capacitor) capacitor_iIs the number of sequencing reads covered on the ith capture interval; RC (resistor-capacitor) capacitor_mThe median number of sequencing reads for capture intervals located on autosomes of all capture intervals.

Optionally, according to the above embodiment of the present application, the obtaining of the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter by normalizing the GC content according to the coverage and the GC content includes:

step S1531, the GC content is normalized by the following formula to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter:

DepthB_i＝DepthA_i-rolling_median(All_region_GC)_i；

wherein, DepthB_iIs the GC content depth after GC content normalization of the ith capture interval; DepthA_iThe depth after the sequencing read length number target normalization of the ith capture interval; rolling _ mean (All _ region _ GC)_iThe sliding median of the depth after the sequencing read number target of the ith interval is normalized after all the capture intervals are sequenced from large to small according to GC content.

Optionally, according to the above embodiment of the present application, normalizing the interval length according to the GC content depth and the interval length to obtain the interval length depth of the first capture interval parameter and the interval length depth of the second capture interval parameter includes:

step S1551, the interval length depth of the first capturing interval parameter and the interval length depth of the second capturing interval parameter are obtained by normalizing the interval lengths according to the following formula:

DepthC_i＝DepthB_i-rolling_median(All_region_size)_i；

wherein, DepthC_iIs the section length depth after the section length standardization of the ith capture section; DepthB_iIs the depth after the GC content of the ith capture interval is normalized; rolling _ mean (All _ region _ size)_iThe sliding median of the depth after the GC content of the ith interval is normalized after all the capture intervals are sorted from large to small according to the interval length.

Optionally, according to the above embodiments of the present application, determining the first copy number variation score according to the first depth includes:

step S171, determining a third copy number variation score of the normal population according to the first depth, wherein the third copy number variation score is used for representing the fluctuation range of the copy number variation of the gene data of the normal population in each capture interval;

in step S173, the first copy number variation score is determined according to the third copy number variation score.

The above steps can be executed by a reference module, which is used for analyzing the bam file of the normal population, and obtaining the baseline level and fluctuation statistics of the gene occurrence CNV events on the capture interval of the normal population, and the normalized Reads Count (RC) of each capture interval, i.e. the first copy number variation score (for example: GSC (gene-specific-score) file) of the normal population and the coverage of the plasma capture interval of the normal population (for example: COV file).

Optionally, according to the above embodiments of the present application, determining a third copy number variation score of the normal population according to the first depth includes:

step S1711, determining a third copy number variation score of the normal population by the following formula:

wherein RZ_iIs the third copy number variation score of the ith capture interval of the single sample of the normal population; depth C_iThe length depth of the section after the ith capturing section of the single sample of the normal crowd is standardized; mean (Depth)_{nc_i}) Is the mean value of the length and depth of the interval after the normalization of the ith capturing interval of all normal people; std (Depth)_{nc_i}) Is the standard deviation of the interval length depth of all normal population after the ith capture interval is normalized.

Optionally, according to the above embodiments of the present application, determining the first copy number variation score according to the third copy number variation score includes:

step S1731, determining a first copy number variation score according to the third copy number variation score by the following formula:

wherein, GCS_gIs the first copy number variation score of the gene g in a single sample of the normal population, and n is the number of capture intervals g has.

Optionally, according to the above embodiments of the present application, determining the second copy number variation score according to the first depth and the second depth includes:

step S175, determining a fourth copy number variation score of the target to be detected according to the first depth and the second depth, wherein the fourth copy number variation score is used for representing the fluctuation range of the copy number variation of the target gene data to be detected in each capture interval;

in step S177, a second copy number variation score is determined according to the fourth copy number variation score.

The above steps can be performed by using a sigtest module, and the sigtest module is used for statistically analyzing the difference and fluctuation statistical scoring between the gene CNV variation level in the cfDNA sample to be tested and the normal population baseline CNV level, so as to obtain a gene copy number variation statistical scoring GCS file (second copy number variation score) and a capture interval statistical scoring RZ file (namely fourth copy number variation score) of the cfDNA sample to be tested.

Optionally, according to the above embodiments of the present application, determining a fourth copy number variation score of the normal population according to the first depth and the second depth includes:

step S1751, calculating and determining a fourth copy number variation score of the normal population by the following formula:

wherein, RZ'_iIs the fourth copy number variation score of the ith capture interval of the target to be detected; depth' C_iThe length depth of the section after the ith capture section of the target to be detected is standardized; mean (Depth)_{nc_i}) Is the average value of the interval length and depth of the normal population after the ith capture interval is standardized; std (Depth)_{nc_i}) Is the standard deviation of the interval length depth of the normal population sample after the normalization of the ith capture interval.

Optionally, according to the foregoing embodiment of the present application, determining the second copy number variation score according to the fourth copy number variation score includes:

step S1771, determining a second copy number variation score according to the fourth copy number variation score by the following formula:

wherein, GCS_g' is the second copy number variation score of the gene g of the target sample to be tested, and n is the number of capture intervals that the gene g has.

Optionally, according to the above embodiment of the present application, determining a copy number variation detection result of target gene data to be detected according to the first copy number variation score and the second copy number variation score includes:

in step S191, it is determined whether the second copy number variation score falls within a fluctuation range determined according to the first copy number variation score.

Specifically, each sample in the normal population can obtain a first copy number variation score, the fluctuation range can be determined according to the first copy number variation score of each single sample in the normal population, and then whether the second copy number variation score of the target to be detected belongs to the fluctuation range determined by the normal population or not is judged.

And step S193, determining the copy number mutation of the target to be detected under the condition that the judgment result is that the target does not belong to the target.

In the above steps, when the judgment result is that the gene does not belong to the target, determining that the copy number of the target to be detected and the copy number of the normal population on the gene have a significant difference, and obtaining a detection result that the copy number of the gene is mutated.

Optionally, according to the above embodiments of the present application, determining the sequencing result of the normal population and the sequencing result of the target to be detected respectively through the gene data of the normal population and the gene data of the target to be detected includes:

and S113, comparing the normal population gene data and the target gene data to be detected with the human reference genome data respectively to obtain a first comparison result of the normal population gene data and a second comparison result of the target gene data to be detected.

And S115, sequencing and marking the first comparison result and the second comparison result respectively for repeated processing to obtain a sequencing result of the normal population and a sequencing result of the target to be detected.

Specifically, the human reference genome data may be a fasta file, and the positions of the genes of the normal population and the genes of the target to be detected in the human reference genome are determined by the above comparison.

Optionally, according to the embodiment of the present application, after determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score, the method further includes:

and step S117, determining the copy number ratio of the target to be detected relative to the normal population according to the first copy number variation score and the second copy number variation score.

And step S119, determining the copy number of the target gene data to be detected in the capturing interval according to the copy number ratio and the copy number of the normal population gene data in the capturing interval.

Through the operations in the above steps, the finally output detection result may include the second copy number variation score, the copy number ratio and the specific copy number.

Further, the number of capture intervals and the second copy number variation score may be filtered, for example: the filtration conditions may be: region number >5, GCS > 95% quantile of GCS for normal population, and finally output a file containing genes for which copy number variation is assumed and detailed information on the capture interval for those genes. The genes which are determined to have copy number variation and the detailed information of the genes on the capture interval can be used as the information with statistical significance of the target to be detected to prepare a copy number variation gene list, and the positions and copy number distribution conditions of the copy number variation of the gene capture interval of the cfDNA sample of the target to be detected on different human chromosomes are drawn.

Optionally, according to the embodiment of the present application, after determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score, the method further includes: and determining the sex parameter of the target to be detected according to the second depth of the target to be detected.

In the above step, the gender of the target to be detected can be judged and the result output by using the RC distribution of the target to be detected on the sex chromosome by counting the RC of the normal human group in each target capturing interval. The judgment standard may be: female: x chromosome copy number >1.8& Y chromosome copy number less than 0.2; male: x chromosome copy number >0.6& Y chromosome copy number greater than 0.5.

The scheme provided by the embodiment of the application can capture different numbers of reads in capture target regions (target regions) of different genes based on capture sequencing. Eliminating the influence of batch and sample quality on the number of captured reads in the gene interval by the standard correction of the sample data volume; the influence of different DNA sequences on the capture efficiency is eliminated through GC content standardization process correction; the data size is normalized by the capture interval length to eliminate the difference bias of the numbers of reads captured by target intervals of different lengths of different genes. Firstly, calculating the CNV standard level fluctuation range and the statistical score of each gene of the plasma of a normal population without cancer by using the corrected capture target interval standardized reads number, then calculating the CNV change multiple and the statistical score of the plasma sample of a cancer patient compared with the baseline of the normal population, finally judging whether the score is obvious or not, comprehensively considering the capture interval number of each gene, and outputting a gene list with the copy number having obvious change.

Fig. 2 is a schematic diagram of a method for detecting copy number variation of circulating tumor DNA according to an embodiment of the present application, and the method for detecting copy number variation of circulating tumor DNA is described below with reference to fig. 2, and mainly includes the following steps:

1. acquiring baseline DNA capture sequencing fastq data of a normal population and capture sequencing fastq data of a plasma cfDNA sample of a tumor patient to be detected, performing sequence comparison on the baseline DNA capture sequencing fastq data of the normal population and the capture sequencing fastq data of the plasma cfDNA sample of the tumor patient to be detected and a human reference genome fasta file by using genome comparison software bwa mem, and performing sequencing and mark repeated processing on the compared bam files by using samtools and picard software to obtain a normal population sequencing result bam file after mark sequencing and a sequencing result bam file of the cfDNA sample to be detected;

2. inputting a target capture interval bed file and a human reference genome sequence fasta file, processing the target capture interval bed file by using a ctCNV algorithm bed file processing module, acquiring GC content, interval length and gene information of each capture interval, and sequencing according to chromosome and genome positions;

3. analyzing the bam file of the normal population by using a ctCNV algorithm reference module to obtain the baseline level and fluctuation statistics of CNV events of the genes in the normal population capturing interval and the normalized Reads Count (RC) of each capturing interval, namely the CNV statistics scoring GCS file of the plasma genes of the normal population and the COV file of the coverage of the plasma capturing interval of the normal population;

4. analyzing the bam file of the to-be-detected blood plasma cfDNA sample and the capture interval information obtained by processing in the step 2 by using a ctCNV algorithm coverage module, and outputting a capture interval standard RC of the to-be-detected sample, namely a capture interval coverage COV file of the to-be-detected cfDNA sample;

5. statistically analyzing the difference and fluctuation statistical scoring of the gene CNV variation level in the cfDNA sample to be tested and the normal population baseline CNV level by using a ctCNV algorithm signature module to obtain a gene copy number variation statistical scoring GCS file and a capture interval statistical scoring RZ file of the cfDNA sample to be tested;

6. screening gene copy number variation with statistical significance and enough capture intervals by using a call module of a ctCNV algorithm to obtain a copy number variation gene list with statistical significance in a cfDNA sample to be detected;

7. the ctCNV algorithm scatter module can be selectively used for drawing the positions and copy number distribution conditions of the copy number variation of the gene capture interval of the cfDNA sample to be detected on different human chromosomes;

with the solution provided by the above embodiment, the required input file includes: a sequencing data file (bam format, including the name of each sequencing fragment, SAM mark, position information, comparison quality information, CIGAR string, mate pair information, fragment sequence, sequencing quality and the like), a target capture interval file (bed format, including chromosome, target capture interval starting point, terminating point and gene four-column information), and a human reference genome sequence (fasta format) are generated after the steps of comparing, sequencing, filtering, marking repetition and the like are carried out on a sample to be detected and each reference population group sample;

with the scheme provided by the above embodiment, the obtained output file includes: the method comprises the steps that a user self-constructs a COV file (containing coverage information of a person group on each section in a target capturing section) and a GCS file (containing copy number variation statistics scoring of the person group in units of genes), a COV file (containing coverage information of the reference person group and a sample to be detected on each capturing section), an RZ file (containing copy number variation statistics scoring of the sample to be detected on each capturing section), a GCS file (containing copy number variation statistics scoring of the sample to be detected in units of genes), a copy number identification result SCNA file (containing gene copy number identification results of the sample to be detected), a filtered copy number variation identification result SCNA file (containing gene copy number variation identification results of the sample to be detected) and an image file displaying the copy number variation results.

The modules present in the above embodiments are described below, and the present invention implements the overall process using the python language (version number 2.7.13). The algorithm mainly comprises a reference module, a coverage module, a sigtest module, a call module, an autocall module, a scatter module and a gusesex module.

First, it should be noted that, in addition to the above modules, the target capture interval bed file processing module is also included, and the module is an embedded module, and its basic functions are used in both the reference module and the coverage module. The input parameters are provided by a reference module or a coverage module (a target capture interval bed file and a human reference genome sequence fasta file), and the statistics of GC content, interval length and gene information of each genome capture interval contained in the target capture interval bed file are mainly realized, and the sequencing is carried out according to chromosome and genome positions.

Reference group matrix construction module (reference): the module requires input of minimum alignment quality (default to 0 if not input), output of folder path, output of file prefix, bam file of reference population group, target capture interval bed file and human reference genome sequence fasta file. And counting the RC of the reference human group in each target capture interval by using a bed file processing module and a reference module, and calculating the sequencing depth of each sample after the data volume standardization, the GC content standardization and the capture interval length standardization in each interval by using the data. And forming a matrix by the RC and the standardized sequencing depth together and outputting the matrix into a COV file format. The specific standardization method is shown in the above steps S151 to S155 and the sub-steps S1511, S1531, and S151 of these steps, and will not be described herein again.

On the basis of calculating the standardized sequencing depth of each interval of each sample of the reference human group, the reference human group matrix construction module (reference) is also used for calculating a Z-Score (RZ) taking a target capture interval as a unit, merging RZ values of the target capture intervals covering the same Gene, calculating a Gene Center Score (GCS) of the fluctuation of the sequencing depth of the Gene level, calculating a GCS threshold value aiming at the Gene level according to a preset confidence level, forming a GCS matrix and outputting a GCS file. Prefixes of output paths and output file names of the output COV file and the GCS file are specified by input parameters. The specific method is shown in the above steps S1711 and S1731, and is not described herein again.

A COV matrix construction module (coverage) of a sample to be detected: the module requires inputting the minimum alignment quality (default is 0 if not inputting), outputting the file prefix, the bam file of the reference human group, the path of the bed file of the target capture interval and the fasta file of the human reference genome sequence. By using the bed file processing module, the RC of the sample to be detected in each target capturing interval is counted, the data is used for calculating the sequencing depth of each sample after the data volume standardization, the GC content standardization and the capturing interval length standardization in each interval, and the RC and the standardized sequencing depth jointly form a matrix to output a COV file. The interval standardized sequencing depth calculation method is the same as the reference module. The output path is the current folder, and the prefix of the output file is specified by the input parameters.

GCS calculation module (sigtest) of the sample to be tested: the module requires the input and output file prefix, the reference people group and the path of the COV file of the sample to be tested. The method comprises the steps of calculating Z-score by taking a target capture interval as a unit by utilizing a reference human group and a COV matrix of a sample to be detected, generating an RZ file, calculating GCS by taking a gene as a unit by utilizing the Z-score, and generating a GCS file of the sample to be detected. The RZ and GCS calculation method is the same as the reference module. The output path is the current folder, and the prefix of the output file is specified by the input parameters. The specific method is shown in step S1751 and step S1771, which are not described herein again.

Result output module (call): the module requires the input and output file prefixes, the reference person group and the path of the sample GCS file to be tested. The method comprises the steps of calculating the copy number Ratio of a sample to be detected relative to a reference person group by taking genes as a unit by utilizing information of the reference person group, the sample to be detected GCS and the like, and outputting an SCNA file containing information of GCS, the copy number Ratio (log2Ratio), the copy number (copy number) of the gene of the sample to be detected, the number of target capturing sections (region number) of each gene and the like. The module also finally outputs a file containing the genes which are determined to have copy number variation and the detailed information of the genes on the capturing intervals on the basis of the number of the target capturing intervals of each gene and the GCS (filtering conditions: region number >5, GCS > 95% quantile of GCS of normal population). The output path is the current folder, and the prefix of the output file is specified by the input parameters.

Automated copy number variation identification module (autocall): the COV matrix construction module, the GCS calculation module and the result output module of the sample to be tested are integrated, all identification steps (coverage module + sigtest module + call module) can be completed at one time by inputting the information required by all the three modules at one time, and the copy number identification result is output.

Variant identification visualization module (scatter): the module requires the input of a path of a reference population group, a COV file of a sample to be tested, and a copy number variation result (SCNA) file of the sample to be tested after filtering. The method utilizes information provided in an input file, takes the chromosome as a horizontal axis (the target capture interval is a unit) and the copy number as a vertical axis, draws a scatter diagram to display the copy number of each target capture interval, enables the distribution of copy number amplified and deleted genes on each chromosome to be clear at a glance, and facilitates the explanation of an identification result.

Gender determination module (gusatex): the module requires inputting the minimum comparison quality (default is 0 if not inputting), outputting the file prefix, the bam file of the reference human group, the path of the bed file of the target capture interval and the fastq file of the human reference genome sequence. The method comprises the steps of calling a bed file processing module, counting the RC of a reference human group in each target capturing interval, and judging the sex of a sample by using the RC distribution of the sample to be detected on a sex chromosome and outputting a result. The output path is the current folder, and the prefix of the output file is specified by the input parameters.

The present invention provides two copy number detection modes of operation. The first method comprises the steps of respectively operating a COV matrix construction module (coverage module) of a sample to be tested, a GCS calculation module (sigtest module) of the sample to be tested and a result output module (call module); the second is an automatic copy number variation identification module (autocall module) which integrates the three modules, the automatic copy number variation identification module is operated in a single-thread mode, and only one sample to be detected can be identified at one time.

The embodiment can know that the scheme combines the reference of the normal population baseline gene CNV and the fluctuation level thereof, RC value standardization in the capture interval and gene level copy number statistical scoring, and can effectively improve the CNV detection sensitivity and accuracy of plasma ctDNA.

The scheme provided by the above embodiment is verified by way of example below.

1. Cell line culture

The cell lines NCI-N87(ATCC CRL-5822) and BEAS-2B (ATCC CRL-9609) were purchased from Bai Biotech Co., Ltd, of Nanjing Ke, and cultured according to the provided instructions by adding 10% fetal bovine serum to RPMI-1640 medium and culturing at 37 ℃.

2. Cell DNA extraction

After collecting the cell suspension, the cell suspension was centrifuged at 300g at room temperature for 5 minutes, the supernatant was discarded, the cells were resuspended with 200uLPBS, and then genomic DNA extraction was performed using the QIAamp DNA Mini Kit (cat # 51304; Qiagen, Germany). After lysis, the DNA was purified by column chromatography and eluted with low-TE buffer.

3. Determination of the copy number of ERBB2 in both cell lines by ddPCR

The ddPCR experiment was performed using genomic DNA extracted from the cells as a template. ddPCR was performed using a Berry instrument, a commercial probe, and a reaction system. The reaction system comprises the following components: 10ul ddPCR supermix for probes (no dUTP),1ul ERBB2 probe, 1ul EIF2C1 probe, and 20ng of test DNA. After the reaction system was prepared, chyle was generated according to the method of use of the apparatus, and chyle was aspirated to a 96-well PCR plate and Heat-sealed with a pierce Foil Heat Seal. The conditions of the PCR reaction were: activating enzyme at 95 deg.C for 8 min; melting at 94 ℃ for 30s, annealing and extending at 55 ℃ for 1min, and performing 39 cycles; inactivating enzyme at 98 deg.C for 10 min; and preserving heat at 4 ℃. After PCR amplification, the berle droplet reader reads the number of fluorescent droplets in each reaction well. Ultrapure water was used as a negative control for each batch of reactions. Three duplicate wells are made for each DNA to be tested as a technical duplicate.

The copy number of ERBB2 of BEAS-2B was 2.16, while the copy number of ERBB2 of NCI-N87 was 125.

4. Preparation of ERBB2 copy number gradient samples

The two cell lines were mixed in mass percent as in table one, plasma cfDNA samples were simulated, 10 samples were prepared, and the expected ERBB2 copy number was calculated.

Watch 1

	NCI-N87	BEAS-2B	Anticipating CNV
				sample1	100％	0.000％	125
sample 2	20％	80.000％	26.6
				sample 3	10％	90.000％	14.3
sample 4	5％	95.000％	8.15
				sample 5	2％	98.000％	4.46
sample 6	1％	99.000％	3.23
				sample 7	0.50％	99.500％	2.615
sample 8	0.20％	99.800％	2.246
				sample 9	0.10％	99.900％	2.123
sample 10	0％	100％	2

5. ddPCR of ERBB2 copy number gradient samples

The ddPCR experiments of 10 samples in the above table were performed in the same manner, and the experimental results obtained can be repeated as shown in Table II, using the cell line DNA samples mixed in mass ratio as the template, adding 20ng of sample DNA to each reaction system, and repeating three wells for each sample.

Watch two

6. Library construction, capture and sequencing of ERBB2 copy number gradient samples

The 10 sample DNAs are firstly broken into DNA fragments of about 200bp by covaris ultrasound, after qubit fluorescence quantification, 20ng of the fragmented DNA is taken as an initial amount to construct a library by adopting KAPA hyper prediction kit (Roche corporation), and the library is obtained by end repair, poly A addition at the 3' end, connection sequencing joint, unbiased amplification of 9 cycles and purification. The details are as follows:

6.1, flush ends and addition of A at the 3' end, the reaction system is as shown in Table three:

watch III

Reagent	Volume of
		Fragmented,double-stranded DNA	50μL
End Repair&A-Tailing Buffer	7μL
		End Repair&A-Tailing Enzyme Mix	3μL
Total volume	60μL

The Buffer and enzyme should be mixed in the EP tube in advance, and vortex-mixed with DNA and then reacted as follows. The reaction steps are shown in table four:

watch four

This step of operation sets the PCR tube lid temperature to 85 ℃ instead of 105 ℃. If the next experiment is performed immediately after the end of the run, the end temperature should be set at 20 ℃ instead of 4 ℃.

6.2, connecting the linker, and adopting 7.5uM linker for 20ng of DNA according to the guidance of the library construction instruction. The reaction system was prepared as per table five: watch five

Reagent	Volume of
		Reaction product	60μL
Volume of joint	5μL
		Ultrapure water	5μL
Connection Buffer	30μL
		DNA ligase	10μL
Total volume	110μL

Buffer and enzyme were mixed well in an EP tube, vortexed, centrifuged, and incubated at 20 ℃ for 15 minutes.

6.3, purification after connection: 88ul of Agencour AMPure XP purified magnetic beads are added into the reaction system (110ul) in the last step, and the mixture is fully vortexed, shaken and slightly centrifuged. Adsorbing for 5-15 minutes at room temperature, fully combining the DNA and the magnetic beads, placing the EP tube to a magnetic rack for adsorbing, clarifying the liquid, slowly absorbing the supernatant in the EP tube, and discarding. The EP tubes were incubated with 200. mu.L of 80% ethanol for 30 seconds and the ethanol was slowly aspirated and discarded. The ethanol washing of the magnetic beads was repeated once. The EP tube is dried for 3-5 minutes at room temperature until the ethanol is completely volatilized. The EP tube was removed from the magnetic frame, 22. mu.L of ultrapure water was added, vortexed, centrifuged slightly and incubated at room temperature for 2 minutes to elute DNA, the EP tube was placed on the magnetic frame and adsorbed until the liquid was clear, the supernatant was transferred to a new EP tube, 1. mu.L of the supernatant was taken to measure the DNA concentration, and the remainder was amplified.

6.4, PCR amplification: the PCR system was prepared as in Table six.

Watch six

Reagent	Volume of
		KAPA HiFi HotStart ReadyMix(2X)	25μL
KAPA Library Amplification Primer Mix	5μL
		(10X)*
Linker ligation libraries	20μL
		Total volume	50L

After sufficient shaking and rapid centrifugation, PCR reactions were performed according to the seven conditions in Table.

Watch seven

Step (ii) of	Temperature of	Time of day	Number of cycles
				Pre-denaturation	98℃	45Sec	1
Denaturation of the material	98℃	15Sec
				Annealing	60℃	30Sec	9
Extension of	72℃	30Sec
				Finally extend	72℃	1min	1
Preservation of	4℃	∞	1

6.5, purification after amplification: adding Agencour AMPure XP purified magnetic beads (50ul) with the same volume as that of a PCR reaction system, fully whirling and oscillating, slightly centrifuging, and adsorbing at room temperature for 5-15 minutes to fully combine the DNA with the magnetic beads. The EP tube is placed on a magnetic frame to be adsorbed until the liquid is clear, and the supernatant in the EP tube is slowly sucked and discarded. 200 μ L of 80% ethanol was added to the EP tube and incubated for 30 seconds, and the ethanol in the EP tube was slowly aspirated and discarded. The ethanol washing of the magnetic beads was repeated once. The EP tube is dried for 3-5 minutes at room temperature until the ethanol is completely volatilized. The EP tube was removed from the magnetic stand, 52. mu.L of ultrapure water was added, vortexed, and slightly centrifuged. The DNA was eluted by incubation at room temperature for 2 minutes, the EP tube was placed on a magnetic rack and the solution was clarified, the supernatant was transferred to a new EP tube, and 1. mu.L of the supernatant was taken for DNA concentration determination.

6.6 before sequencing, a probe capture method is adopted, and a Roche NimbleGen probe is used for enriching and further amplifying a target interval containing the ERBB2 gene to obtain a library of the target interval. And performing computer sequencing after q-PCR quantification.

7. Processing offline fastq data into input files usable by each software

After downloading data, firstly processing the downloaded data from the fastq file into a bam file, wherein the specifically used software and steps are as follows:

7.1 comparison

Calling bwa-0.7.12mem to align each pair of fastq files as paired Reads to the hg19 human reference genome sequence, and generating an initial bam file without using other parameter options except-M parameters and the ID of a specified Reads Group;

7.2 ordering

The SortSam module of picard-2.1.0 is invoked to SORT the initial bam files by chromosome position with the parameter set to "SORT.

7.3 screening

And calling samtools-1.3 view to screen the sorted bam files, and adopting '-F0 x 900' as a parameter.

7.4 creating an index

And calling an index module of samtools-1.3 to establish an index for the finally generated bam file, and generating a bai file matched with the bam file after the marking is repeated.

7.5, repeat labeling

Calling a MarkDuplicates module of picard-2.1.0 to mark the repeated sequences in the screened bam file, filtering the repeated sequences during subsequent copy number analysis, and analyzing by using the data after duplication removal;

8. identification of copy number of ERBB2 copy number gradient samples with other software

8.1、CNVkit

30 blood cell samples with no abnormal copy number are selected as a reference sample group, and capture sequencing and preprocessing of sequencing data are carried out in the same manner. Taking the bam files of 30 samples and the bed files for recording the capture interval, the human reference genome sequence fasta files and the bam files of each ERBB2 copy number gradient sample as input files, adopting default parameters, respectively identifying the copy number of each ERBB2 copy number gradient sample, and obtaining the log2Ratio of each sample relative to the copy number of the reference sample group. Since the CNVkit has no gene-based identification result, the log2Ratio of the CNV fragment containing the ERBB2 gene is taken as the identification result. The log2Ratio is converted to a copy number. The results are shown in table eight:

table eight

Compared with the identification result of ddPCR, the identification result of sample 1 is lower, the results of samples 2-6 are relatively accurate, and the results of samples 6-9 are greatly different. For samples with copy numbers below 3, the CNVkit sensitivity was poor.

8.2、Control-FREE

In identifying copy number variation, sample 10 (i.e., the BEAS-2B cell line) was used as a reference sample, and the bam file of this sample, the bed file of the capture interval, the human reference genomic sequence fasta file, and the bam file of each ERBB2 copy number gradient sample were used as input files. And (3) adopting Control-FREE official recommendation parameters aiming at the captured sequencing data for all parameters, and respectively trying two modes of non-correction identification and correction identification according to purity. Since there was no gene-based identification, the CNV fragment containing ERBB2 gene was taken as the identification of ERBB 2. The results are shown in Table nine:

watch nine

9. Compared with the identification result of ddPCR, the results of samples 1-6 are relatively accurate without correction, and one of three replicates of sample7 is unidentified, and neither sample 8 nor sample 9 is identified as CNV. For samples with copy numbers below 3, Control-FREE was less sensitive. In addition, when the purity is estimated and the result is corrected according to the purity, the deviation of the identification of the purity by the Control-FREE from the actual condition is large, the difference between the corrected result and the result of ddPCR is larger than that in the case of no correction, and the reliability is very poor.

Identification of copy number of ERBB2 copy number gradient samples with the present invention

30 blood cell samples with no abnormal copy number are selected as reference human group samples, and capture sequencing and preprocessing of sequencing data are carried out on the blood cell samples in the same mode. The bam files of 30 samples, the bed files of the recording capture interval and the fasta files of the human reference genome sequence are used as input files, and the COV and GCS files of the reference human group are generated by adopting the reference module.

Inputting a bam file of ERBB2 copy number gradient samples and COV and GCS files of reference people groups, respectively identifying the copy number of the gene covered by the captured interval of each sample by adopting the autocall module of the invention, and obtaining the RZ, COV and GCS files of each sample and two final SCNA result files.

According to the SCNA result file, the identification result of the ERBB2 gene copy number of each concentration gradient sample in the invention is shown in the table ten:

watch ten

As can be seen from the comparison of Table eight, Table nine and Table ten, using the protocol provided by the present invention, it was possible to stably identify the ERBB2 copy number amplification in sample 1-9, and the results of sample 2-9 were all very accurate, except for a certain deviation between the results of sample 1 and ddPCR. Even for samples with very low copy numbers, the invention can stably identify the copy numbers, carry out the statistical test on GCS and give reliable and sensitive test results. Compared with other CNV identification algorithms, the method is more stable, has higher sensitivity and lower detection lower limit no matter the CNV identification is carried out on high and low copy numbers, and is more suitable for the CNV identification of ctDNA samples.

Furthermore, a scatter module can be used to perform chromosome partition copy number variation display on CNV identification results of sample 3 and sample 5, fig. 3 is a schematic diagram of CNV identification result display according to an embodiment of the present application, and it can be known from fig. 3 that the ordinate in the diagram represents copy number (copy number), the abscissa represents CNV sample, and CNV of ERBB2 can be clearly seen on chromosome 17, where ERBB2 copy number of sample 3 is about 7, and ERBB2 copy number of sample 5 is about 4.

Example 2

According to an embodiment of the present invention, there is provided an apparatus for detecting copy number variation, and fig. 4 is a schematic diagram of an apparatus for detecting copy number variation according to an embodiment of the present invention, and with reference to fig. 4, the apparatus includes:

and the determining module 40 is used for determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data.

The obtaining module 42 is configured to obtain a first capture interval parameter and a second capture interval parameter, where the first capture interval parameter is used to represent a parameter of a normal population sequencing result in a capture interval, and the second capture interval parameter is used to represent a parameter of a target to be tested sequencing result in the capture interval.

The normalization processing module 44 is configured to perform normalization processing on the first capture interval parameter and the second capture interval parameter respectively to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter.

A first scoring module 46, configured to determine a first copy number variation score according to the first depth, where the first copy number variation score is used to characterize a fluctuation range of the copy number variation of the normal population gene data.

And a second scoring module 48, configured to determine a second copy number variation score according to the first depth and the second depth, where the second copy number variation score is used to characterize a fluctuation condition of the copy number variation of the target gene data to be detected.

And the detection module 50 is configured to determine a copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score.

Example 3

According to an embodiment of the present invention, a storage medium is provided, which includes a stored program, wherein when the program runs, the apparatus in which the storage medium is controlled to execute the method for detecting copy number variation of circulating tumor DNA described in embodiment 1.

Example 4

According to an embodiment of the present invention, a processor is provided, wherein the processor is configured to execute a program, and when the program is executed, the method for detecting copy number variation of circulating tumor DNA according to embodiment 1 is performed.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for detecting copy number variation in circulating tumor DNA, comprising:

determining a sequencing result of the normal population and a sequencing result of the target to be detected respectively through gene data of the normal population and gene data of the target to be detected;

acquiring a first capture interval parameter and a second capture interval parameter, wherein the first capture interval parameter is used for representing the parameter of the normal population sequencing result on a capture interval, and the second capture interval parameter is used for representing the parameter of the target sequencing result to be detected on the capture interval;

respectively carrying out standardization processing on the first capture interval parameter and the second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter;

determining a first copy number variation score according to the first depth, wherein the first copy number variation score is used for representing the fluctuation range of the copy number variation of the normal population gene data;

determining a second copy number variation score according to the first depth and the second depth, wherein the second copy number variation score is used for representing the fluctuation condition of the copy number variation of the target gene data to be detected;

comparing the second copy number variation score with the first copy number variation score to determine a copy number variation detection result of the target gene data to be detected;

wherein the capture interval parameters include: the method comprises the following steps of sequencing read length number, GC content and interval length, wherein the corresponding depths of the sequencing read length number, the GC content and the interval length are respectively as follows: the method comprises the following steps of respectively standardizing the first capture interval parameter and the second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter, wherein the coverage, the GC content depth and the interval length depth comprise:

standardizing the sequencing read number of each capture interval to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter;

carrying out standardization processing on the GC content according to the coverage and the GC content to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter;

and normalizing the interval length according to the GC content depth and the interval length to obtain the interval length depth of the first capture interval parameter as the first depth and the interval length depth of the second capture interval parameter as the second depth.

2. The method of claim 1, wherein normalizing the number of sequencing reads for each capture interval to obtain a coverage of the first capture interval parameter and a coverage of the second capture interval parameter comprises:

normalizing the number of sequencing reads for each capture interval by the following formula to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter:

；

wherein,

is the coverage after the sequencing read length number target normalization of the ith capture interval;

is the number of sequencing reads covered on the ith capture interval;

the median number of sequencing reads for capture intervals located on autosomes of all capture intervals.

3. The method of claim 2, wherein normalizing the GC-content according to the coverage and the GC-content to obtain a GC-content depth of the first capture interval parameter and a GC-content depth of the second capture interval parameter comprises:

normalizing the GC content by the following formula to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter:

；

wherein,

is the GC content depth after GC content normalization of the ith capture interval;

the depth after the sequencing read length number target normalization of the ith capture interval;

the sliding median of the depth after the sequencing read number target of the ith interval is normalized after all the capture intervals are sequenced from large to small according to GC content.

4. The method of claim 3, wherein normalizing the interval length according to the GC content depth and the interval length to obtain an interval length depth of the first capture interval parameter and an interval length depth of the second capture interval parameter comprises:

normalizing the section length by the following formula to obtain the section length depth of the first capture section parameter and the section length depth of the second capture section parameter:

；

wherein,

is the section length depth after the section length standardization of the ith capture section;

is the depth after the GC content of the ith capture interval is normalized;

the sliding median of the depth after the GC content of the ith interval is normalized after all the capture intervals are sorted from large to small according to the interval length.

5. The method of claim 1, wherein determining a first copy number variation score according to the first depth comprises:

determining a third copy number variation score of the normal population according to the first depth, wherein the third copy number variation score is used for representing the fluctuation range of the copy number variation of the normal population gene data in each capture interval;

determining the first copy number variation score according to the third copy number variation score.

6. The method of claim 5, wherein determining a third copy number variation score for the normal population from the first depth comprises:

determining a third copy number variation score for the normal population from the first depth by:

；

wherein,

(ii) is the third copy number variation score for the ith capture interval of the single sample from the normal population;

the length depth of the section after the ith capturing section of the single sample of the normal population is standardized;

is the mean value of the length and depth of the interval after the normalization of the ith capturing interval of all normal people;

is the standard deviation of the interval length depth of all normal population after the ith capture interval is normalized.

7. The method of claim 6, wherein determining the first copy number variation score from the third copy number variation score comprises:

determining the first copy number variation score from the third copy number variation score by:

；

wherein,

is the first copy number variation score of the gene g of the single sample of the normal population, and n is the number of capture intervals that the gene g has.

8. The method of claim 1, wherein determining the second copy number variation score from the first depth and the second depth comprises:

determining a fourth copy number variation score of the target to be detected according to the first depth and the second depth, wherein the fourth copy number variation score is used for representing the fluctuation range of the copy number variation of the target gene data to be detected in each capture interval;

determining the second copy number variation score according to the fourth copy number variation score.

9. The method of claim 8, wherein determining a fourth copy number variation score for the test object based on the first depth and the second depth comprises:

determining a fourth copy number variation score for the normal population by calculating:

；

wherein，

Is the fourth copy number variation score of the ith capture interval of the target to be detected;

the length depth of the section after the ith capture section of the target to be detected is standardized;

is the mean value of the interval length and depth of the normal population after the normalization of the ith capturing interval;

is the standard deviation of the interval length depth of the normal population sample after the normalization of the ith capture interval.

10. The method of claim 9, wherein determining the second copy number variation score from the fourth copy number variation score comprises:

determining the second copy number variation score from the fourth copy number variation score by:

；

wherein,

is the second copy number variation score of the target sample gene g to be detected, and n is the number of capture intervals of the gene g.

11. The method of claim 1, wherein determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score comprises:

judging whether the second copy number variation score belongs to a fluctuation range determined according to the first copy number variation score;

and under the condition that the judgment result is that the target gene data does not belong to the target gene data, determining the copy number mutation of the target gene data to be detected.

12. The method of claim 1, wherein determining the normal population sequencing result and the target to be tested sequencing result from normal population genetic data and target to be tested genetic data, respectively, comprises:

comparing the normal population gene data and the target gene data to be detected with human reference genome data respectively to obtain a first comparison result of the normal population gene data and a second comparison result of the target gene data to be detected;

and respectively sequencing and repeatedly marking the first comparison result and the second comparison result to obtain the sequencing result of the normal population and the sequencing result of the target to be detected.

13. The method of claim 1, wherein after determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score, the method further comprises:

determining the copy number ratio of the target to be detected relative to the normal population according to the first copy number variation score and the second copy number variation score;

and determining the copy number of the target gene data to be detected in the capturing interval according to the copy number ratio and the copy number of the normal population gene data in the capturing interval.

14. The method of any one of claims 1 to 13, wherein after determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score, the method further comprises: and determining the sex parameter of the target to be detected according to the second depth of the target to be detected.

15. An apparatus for detecting copy number variation, comprising:

the determining module is used for determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data;

the acquisition module is used for acquiring a first capture interval parameter and a second capture interval parameter, wherein the first capture interval parameter is used for representing the parameter of the normal population sequencing result on a capture interval, and the second capture interval parameter is used for representing the parameter of the target sequencing result to be detected on the capture interval;

the standardization processing module is used for respectively carrying out standardization processing on the first capture interval parameter and the second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter;

a first scoring module, configured to determine a first copy number variation score according to the first depth, where the first copy number variation score is used to characterize a fluctuation range of copy number variation of the normal population genetic data;

a second scoring module, configured to determine a second copy number variation score according to the first depth and the second depth, where the second copy number variation score is used to characterize fluctuation of copy number variation of the target gene data to be detected;

the detection module is used for determining a copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score;

wherein the capture interval parameters include: the method comprises the following steps of sequencing read length number, GC content and interval length, wherein the corresponding depths of the sequencing read length number, the GC content and the interval length are respectively as follows: the coverage degree, the GC content depth and the interval length depth, wherein the standardization processing module is further used for carrying out standardization processing on the sequencing read number of each capture interval to obtain the coverage degree of the first capture interval parameter and the coverage degree of the second capture interval parameter; carrying out standardization processing on the GC content according to the coverage and the GC content to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter; and normalizing the interval length according to the GC content depth and the interval length to obtain the interval length depth of the first capture interval parameter as the first depth and the interval length depth of the second capture interval parameter as the second depth.

16. A storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the method for detecting copy number variation in circulating tumor DNA according to any one of claims 1 to 14.

17. A processor configured to execute a program, wherein the program is configured to execute the method for detecting copy number variation of circulating tumor DNA according to any one of claims 1 to 14.