Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for detecting copy number variation, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.
Fig. 1 is a flowchart of a method for detecting copy number variation according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:
and step S11, determining a normal population sequencing result and a target to be detected sequencing result respectively according to the normal population gene data and the target to be detected gene data.
Specifically, the normal population gene data can be capture sequencing fastq data of healthy populations, the target to be detected can be any one of the populations to be detected, and the target gene data to be detected can be capture sequencing fastq data of a cfDNA sample of a tumor patient to be detected.
It should be noted that the gene data of the normal population or the target to be detected may be extracted from blood of the normal population or the target to be detected, or may be extracted from other body fluids such as ascites, pleural effusion, cerebrospinal fluid and the like of the normal population or the target to be detected, which is not limited herein.
In an optional embodiment, capturing and sequencing fastq data of a healthy population and capturing and sequencing fastq data of a plasma cfDNA sample of a tumor patient to be detected are obtained, the two fastq data are respectively compared with preset human reference genome fasta data, sequencing and marking repeated processing are carried out on comparison results, a normal population sequencing result and a target sequencing result to be detected are obtained, and both the normal population sequencing result and the target sequencing result to be detected can be bam files.
And step S13, acquiring a first capture interval parameter and a second capture interval parameter, wherein the first capture interval parameter is used for representing the parameter of the normal population sequencing result on the capture interval, and the second capture interval parameter is used for representing the parameter of the target to be tested sequencing result on the capture interval.
Specifically, the first capture interval parameter and the first capture interval parameter may both include: number of sequencing Reads (RC), GC content, and interval length.
In an alternative embodiment, the foregoing steps may be executed by the bed file processing module provided in this embodiment, the bed file processing module inputs the target capture interval bed file and the human reference genome sequence fasta file, and the bed file processing module processes the target capture interval bed file, so that the GC content, the interval length, and the located gene information of each capture interval can be obtained, and the target capture interval bed file is sorted according to the chromosome and the genome position.
The specific treatment method can be as follows: the human reference genome sequence (fasta file) contains sequence composition information of all human chromosomes, including A, T, C, G base distribution and position information of each base. The bed file processing module can stick each capture interval in the bed file back to the human reference genome sequence, and count the length between the start point and the end point of the capture interval (the end point minus the start point), the GC content percentage of the sequence (the number of G and C bases and/length of the capture interval), and the module can simultaneously extract the unique gene name of each capture interval to obtain the gene information.
Step S15, normalizing the first capture interval parameter and the second capture interval parameter, respectively, to obtain a first capture interval parameter depth and a second capture interval depth.
In particular, the normalization process is used to reduce the impact between multiple capture interval parameters. In an alternative embodiment, the capturing interval parameters including RC, GC content, and interval length are used to illustrate that the RC can be normalized by the GC content and interval length in the capturing interval parameters.
Step S17, determining a first copy number variation score according to the first depth, wherein the first copy number variation score is used to characterize a fluctuation range of the copy number variation of the normal population gene data.
Specifically, the above steps are used to establish a baseline level of copy number for the normal population. The normal population has a plurality of normal individual, and the fluctuation range of the normal population gene data with copy number variation can be obtained according to the first copy number variation score of each normal individual in the normal population.
And step S19, determining a second copy number variation score according to the first depth and the second depth, wherein the second copy number variation score is used for representing the fluctuation condition of the copy number variation of the target gene data to be detected.
Specifically, in the above step, the second copy number variation score is determined by both the first depth and the second depth, so that the deviation degree of the copy number of the target to be detected in the capture interval from the baseline level of the normal population can be determined, and the second copy number variation score is used as a reference for detecting the copy number variation of the target to be detected.
And step S111, comparing the second copy number variation score with the first copy number variation score, and determining the copy number variation detection result of the target gene data to be detected.
It should be noted here that in interpreting the results of DNA copy number variation, the user is actually more concerned about the gene level copy number mutation. At present, the final output results of all copy number variation identification software do not take genes as copy number variation units, but determine the breakpoint of each mutation through an algorithm, so that the copy number variation results aiming at the genes cannot be given. In addition, when the software is used for identification, the method mainly depends on setting a copy number threshold value, and whether variation exists is judged according to whether the copy number of the sample to be detected exceeds the threshold value, so that the method is very unfavorable for identifying the cfDNA sample with low ctDNA concentration, and the sensitivity is very low. The scheme of the invention takes the distribution situation of the copy number in the population as a standard, and identifies whether the copy number variation exists or not through statistical test. When the sample gene passes through statistical test and proves that the copy number of a sample to be tested and a reference human group on a certain gene has a significant difference, the identification result of copy number mutation can be given, so that the problem of low detection sensitivity caused by setting an improper copy number threshold value under the condition of unknown ctDNA concentration is solved, and each copy number mutation identified by the sample gene can be ensured to be reliable.
As can be seen from the above, in the embodiments of the present application, the gene copy number variation score of the normal population is calculated, and the score range of the copy number variation of the normal population is determined according to the gene copy number variation score of the normal population, so that the gene copy number of the target to be detected is detected according to the range determined by the copy number variation score, and thus the technical problem of poor detection accuracy of the copy number variation of the circulating tumor DNA in the prior art due to setting of an inappropriate copy number threshold is solved, and the purpose of improving the detection accuracy of the copy number variation of the circulating tumor DNA is achieved.
Optionally, according to the above embodiment of the present application, the capturing interval parameter includes: the number of sequencing reads, the GC content and the interval length, wherein the corresponding depths of the number of sequencing reads, the GC content and the interval length are respectively as follows: the method comprises the following steps of respectively standardizing a first capture interval parameter and a second capture interval parameter to obtain a first depth of the first capture interval parameter and a second depth of the second capture interval parameter, wherein the coverage, the GC content depth and the interval length depth comprise:
step S151, the number of sequencing reads of each capture interval is standardized to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter.
Step S153, carrying out standardization processing on the GC content according to the coverage and the GC content to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter.
Step S155, standardizing the section length according to the GC content depth and the section length to obtain the section length depth of the first capture section parameter and the section length depth of the second capture section parameter.
In the above steps, the autosomal sequencing overlay data are normalized, GC content normalized and capture interval length normalized. The method adopts RC as the measurement of the coverage of the capture interval, and is influenced by GC content and the length of the capture interval. In order to avoid bias error caused by estimation of coverage by two factors, the method calculates the GC content and the interval length of each capturing interval, and utilizes the information to carry out standardization processing on the RC.
It should be noted here that there exists in the prior art authentication software that uses a group of people as a reference sample, but it usually adopts a median or a mean value to summarize the coverage of the reference group on each segment into a numerical value, and although this numerical value can reflect a part of the information of the overall distribution, the information related to the fluctuation of the coverage in the group of people is lost, and when the number of the reference group samples is more, the lost information is more. In the scheme of the application, the coverage is obtained by performing standardized processing on the sequencing read-length data, and the information is reserved, so that the deviation degree of the target to be detected and the reference population group baseline level can be measured more accurately, and the more population group samples, the more obvious the advantages of the identification software in the prior art are.
Next, the normalization method of the above three data will be described in order.
Optionally, according to the foregoing embodiment of the present application, the normalizing the number of sequencing reads of each capture interval to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter includes:
step S1511, standardizing the number of sequencing reads of each capture interval by the following formula to obtain the coverage of the first capture interval parameter and the coverage of the second capture interval parameter:
DepthAi=log2(RCi)-log2(RCm);
wherein, DepthAiIs the coverage after the sequencing read length number target normalization of the ith capture interval; RC (resistor-capacitor) capacitoriIs the number of sequencing reads covered on the ith capture interval; RC (resistor-capacitor) capacitormThe median number of sequencing reads for capture intervals located on autosomes of all capture intervals.
Optionally, according to the above embodiment of the present application, the obtaining of the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter by normalizing the GC content according to the coverage and the GC content includes:
step S1531, the GC content is normalized by the following formula to obtain the GC content depth of the first capture interval parameter and the GC content depth of the second capture interval parameter:
DepthBi=DepthAi-rolling_median(All_region_GC)i;
wherein, DepthBiIs the GC content depth after GC content normalization of the ith capture interval; DepthAiThe depth after the sequencing read length number target normalization of the ith capture interval; rolling _ mean (All _ region _ GC)iThe sliding median of the depth after the sequencing read number target of the ith interval is normalized after all the capture intervals are sequenced from large to small according to GC content.
Optionally, according to the above embodiment of the present application, normalizing the interval length according to the GC content depth and the interval length to obtain the interval length depth of the first capture interval parameter and the interval length depth of the second capture interval parameter includes:
step S1551, the interval length depth of the first capturing interval parameter and the interval length depth of the second capturing interval parameter are obtained by normalizing the interval lengths according to the following formula:
DepthCi=DepthBi-rolling_median(All_region_size)i;
wherein, DepthCiIs the section length depth after the section length standardization of the ith capture section; DepthBiIs the depth after the GC content of the ith capture interval is normalized; rolling _ mean (All _ region _ size)iThe sliding median of the depth after the GC content of the ith interval is normalized after all the capture intervals are sorted from large to small according to the interval length.
Optionally, according to the above embodiments of the present application, determining the first copy number variation score according to the first depth includes:
step S171, determining a third copy number variation score of the normal population according to the first depth, wherein the third copy number variation score is used for representing the fluctuation range of the copy number variation of the gene data of the normal population in each capture interval;
in step S173, the first copy number variation score is determined according to the third copy number variation score.
The above steps can be executed by a reference module, which is used for analyzing the bam file of the normal population, and obtaining the baseline level and fluctuation statistics of the gene occurrence CNV events on the capture interval of the normal population, and the normalized Reads Count (RC) of each capture interval, i.e. the first copy number variation score (for example: GSC (gene-specific-score) file) of the normal population and the coverage of the plasma capture interval of the normal population (for example: COV file).
Optionally, according to the above embodiments of the present application, determining a third copy number variation score of the normal population according to the first depth includes:
step S1711, determining a third copy number variation score of the normal population by the following formula:
wherein RZiIs the third copy number variation score of the ith capture interval of the single sample of the normal population; depth CiThe length depth of the section after the ith capturing section of the single sample of the normal crowd is standardized; mean (Depth)nc_i) Is the mean value of the length and depth of the interval after the normalization of the ith capturing interval of all normal people; std (Depth)nc_i) Is the standard deviation of the interval length depth of all normal population after the ith capture interval is normalized.
Optionally, according to the above embodiments of the present application, determining the first copy number variation score according to the third copy number variation score includes:
step S1731, determining a first copy number variation score according to the third copy number variation score by the following formula:
wherein, GCSgIs the first copy number variation score of the gene g in a single sample of the normal population, and n is the number of capture intervals g has.
Optionally, according to the above embodiments of the present application, determining the second copy number variation score according to the first depth and the second depth includes:
step S175, determining a fourth copy number variation score of the target to be detected according to the first depth and the second depth, wherein the fourth copy number variation score is used for representing the fluctuation range of the copy number variation of the target gene data to be detected in each capture interval;
in step S177, a second copy number variation score is determined according to the fourth copy number variation score.
The above steps can be performed by using a sigtest module, and the sigtest module is used for statistically analyzing the difference and fluctuation statistical scoring between the gene CNV variation level in the cfDNA sample to be tested and the normal population baseline CNV level, so as to obtain a gene copy number variation statistical scoring GCS file (second copy number variation score) and a capture interval statistical scoring RZ file (namely fourth copy number variation score) of the cfDNA sample to be tested.
Optionally, according to the above embodiments of the present application, determining a fourth copy number variation score of the normal population according to the first depth and the second depth includes:
step S1751, calculating and determining a fourth copy number variation score of the normal population by the following formula:
wherein, RZ'iIs the fourth copy number variation score of the ith capture interval of the target to be detected; depth' CiThe length depth of the section after the ith capture section of the target to be detected is standardized; mean (Depth)nc_i) Is the average value of the interval length and depth of the normal population after the ith capture interval is standardized; std (Depth)nc_i) Is the standard deviation of the interval length depth of the normal population sample after the normalization of the ith capture interval.
Optionally, according to the foregoing embodiment of the present application, determining the second copy number variation score according to the fourth copy number variation score includes:
step S1771, determining a second copy number variation score according to the fourth copy number variation score by the following formula:
wherein, GCSg' is the second copy number variation score of the gene g of the target sample to be tested, and n is the number of capture intervals that the gene g has.
Optionally, according to the above embodiment of the present application, determining a copy number variation detection result of target gene data to be detected according to the first copy number variation score and the second copy number variation score includes:
in step S191, it is determined whether the second copy number variation score falls within a fluctuation range determined according to the first copy number variation score.
Specifically, each sample in the normal population can obtain a first copy number variation score, the fluctuation range can be determined according to the first copy number variation score of each single sample in the normal population, and then whether the second copy number variation score of the target to be detected belongs to the fluctuation range determined by the normal population or not is judged.
And step S193, determining the copy number mutation of the target to be detected under the condition that the judgment result is that the target does not belong to the target.
In the above steps, when the judgment result is that the gene does not belong to the target, determining that the copy number of the target to be detected and the copy number of the normal population on the gene have a significant difference, and obtaining a detection result that the copy number of the gene is mutated.
Optionally, according to the above embodiments of the present application, determining the sequencing result of the normal population and the sequencing result of the target to be detected respectively through the gene data of the normal population and the gene data of the target to be detected includes:
and S113, comparing the normal population gene data and the target gene data to be detected with the human reference genome data respectively to obtain a first comparison result of the normal population gene data and a second comparison result of the target gene data to be detected.
And S115, sequencing and marking the first comparison result and the second comparison result respectively for repeated processing to obtain a sequencing result of the normal population and a sequencing result of the target to be detected.
Specifically, the human reference genome data may be a fasta file, and the positions of the genes of the normal population and the genes of the target to be detected in the human reference genome are determined by the above comparison.
Optionally, according to the embodiment of the present application, after determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score, the method further includes:
and step S117, determining the copy number ratio of the target to be detected relative to the normal population according to the first copy number variation score and the second copy number variation score.
And step S119, determining the copy number of the target gene data to be detected in the capturing interval according to the copy number ratio and the copy number of the normal population gene data in the capturing interval.
Through the operations in the above steps, the finally output detection result may include the second copy number variation score, the copy number ratio and the specific copy number.
Further, the number of capture intervals and the second copy number variation score may be filtered, for example: the filtration conditions may be: region number >5, GCS > 95% quantile of GCS for normal population, and finally output a file containing genes for which copy number variation is assumed and detailed information on the capture interval for those genes. The genes which are determined to have copy number variation and the detailed information of the genes on the capture interval can be used as the information with statistical significance of the target to be detected to prepare a copy number variation gene list, and the positions and copy number distribution conditions of the copy number variation of the gene capture interval of the cfDNA sample of the target to be detected on different human chromosomes are drawn.
Optionally, according to the embodiment of the present application, after determining the copy number variation detection result of the target gene data to be detected according to the first copy number variation score and the second copy number variation score, the method further includes: and determining the sex parameter of the target to be detected according to the second depth of the target to be detected.
In the above step, the gender of the target to be detected can be judged and the result output by using the RC distribution of the target to be detected on the sex chromosome by counting the RC of the normal human group in each target capturing interval. The judgment standard may be: female: x chromosome copy number >1.8& Y chromosome copy number less than 0.2; male: x chromosome copy number >0.6& Y chromosome copy number greater than 0.5.
The scheme provided by the embodiment of the application can capture different numbers of reads in capture target regions (target regions) of different genes based on capture sequencing. Eliminating the influence of batch and sample quality on the number of captured reads in the gene interval by the standard correction of the sample data volume; the influence of different DNA sequences on the capture efficiency is eliminated through GC content standardization process correction; the data size is normalized by the capture interval length to eliminate the difference bias of the numbers of reads captured by target intervals of different lengths of different genes. Firstly, calculating the CNV standard level fluctuation range and the statistical score of each gene of the plasma of a normal population without cancer by using the corrected capture target interval standardized reads number, then calculating the CNV change multiple and the statistical score of the plasma sample of a cancer patient compared with the baseline of the normal population, finally judging whether the score is obvious or not, comprehensively considering the capture interval number of each gene, and outputting a gene list with the copy number having obvious change.
Fig. 2 is a schematic diagram of a method for detecting copy number variation of circulating tumor DNA according to an embodiment of the present application, and the method for detecting copy number variation of circulating tumor DNA is described below with reference to fig. 2, and mainly includes the following steps:
1. acquiring baseline DNA capture sequencing fastq data of a normal population and capture sequencing fastq data of a plasma cfDNA sample of a tumor patient to be detected, performing sequence comparison on the baseline DNA capture sequencing fastq data of the normal population and the capture sequencing fastq data of the plasma cfDNA sample of the tumor patient to be detected and a human reference genome fasta file by using genome comparison software bwa mem, and performing sequencing and mark repeated processing on the compared bam files by using samtools and picard software to obtain a normal population sequencing result bam file after mark sequencing and a sequencing result bam file of the cfDNA sample to be detected;
2. inputting a target capture interval bed file and a human reference genome sequence fasta file, processing the target capture interval bed file by using a ctCNV algorithm bed file processing module, acquiring GC content, interval length and gene information of each capture interval, and sequencing according to chromosome and genome positions;
3. analyzing the bam file of the normal population by using a ctCNV algorithm reference module to obtain the baseline level and fluctuation statistics of CNV events of the genes in the normal population capturing interval and the normalized Reads Count (RC) of each capturing interval, namely the CNV statistics scoring GCS file of the plasma genes of the normal population and the COV file of the coverage of the plasma capturing interval of the normal population;
4. analyzing the bam file of the to-be-detected blood plasma cfDNA sample and the capture interval information obtained by processing in the step 2 by using a ctCNV algorithm coverage module, and outputting a capture interval standard RC of the to-be-detected sample, namely a capture interval coverage COV file of the to-be-detected cfDNA sample;
5. statistically analyzing the difference and fluctuation statistical scoring of the gene CNV variation level in the cfDNA sample to be tested and the normal population baseline CNV level by using a ctCNV algorithm signature module to obtain a gene copy number variation statistical scoring GCS file and a capture interval statistical scoring RZ file of the cfDNA sample to be tested;
6. screening gene copy number variation with statistical significance and enough capture intervals by using a call module of a ctCNV algorithm to obtain a copy number variation gene list with statistical significance in a cfDNA sample to be detected;
7. the ctCNV algorithm scatter module can be selectively used for drawing the positions and copy number distribution conditions of the copy number variation of the gene capture interval of the cfDNA sample to be detected on different human chromosomes;
with the solution provided by the above embodiment, the required input file includes: a sequencing data file (bam format, including the name of each sequencing fragment, SAM mark, position information, comparison quality information, CIGAR string, mate pair information, fragment sequence, sequencing quality and the like), a target capture interval file (bed format, including chromosome, target capture interval starting point, terminating point and gene four-column information), and a human reference genome sequence (fasta format) are generated after the steps of comparing, sequencing, filtering, marking repetition and the like are carried out on a sample to be detected and each reference population group sample;
with the scheme provided by the above embodiment, the obtained output file includes: the method comprises the steps that a user self-constructs a COV file (containing coverage information of a person group on each section in a target capturing section) and a GCS file (containing copy number variation statistics scoring of the person group in units of genes), a COV file (containing coverage information of the reference person group and a sample to be detected on each capturing section), an RZ file (containing copy number variation statistics scoring of the sample to be detected on each capturing section), a GCS file (containing copy number variation statistics scoring of the sample to be detected in units of genes), a copy number identification result SCNA file (containing gene copy number identification results of the sample to be detected), a filtered copy number variation identification result SCNA file (containing gene copy number variation identification results of the sample to be detected) and an image file displaying the copy number variation results.
The modules present in the above embodiments are described below, and the present invention implements the overall process using the python language (version number 2.7.13). The algorithm mainly comprises a reference module, a coverage module, a sigtest module, a call module, an autocall module, a scatter module and a gusesex module.
First, it should be noted that, in addition to the above modules, the target capture interval bed file processing module is also included, and the module is an embedded module, and its basic functions are used in both the reference module and the coverage module. The input parameters are provided by a reference module or a coverage module (a target capture interval bed file and a human reference genome sequence fasta file), and the statistics of GC content, interval length and gene information of each genome capture interval contained in the target capture interval bed file are mainly realized, and the sequencing is carried out according to chromosome and genome positions.
Reference group matrix construction module (reference): the module requires input of minimum alignment quality (default to 0 if not input), output of folder path, output of file prefix, bam file of reference population group, target capture interval bed file and human reference genome sequence fasta file. And counting the RC of the reference human group in each target capture interval by using a bed file processing module and a reference module, and calculating the sequencing depth of each sample after the data volume standardization, the GC content standardization and the capture interval length standardization in each interval by using the data. And forming a matrix by the RC and the standardized sequencing depth together and outputting the matrix into a COV file format. The specific standardization method is shown in the above steps S151 to S155 and the sub-steps S1511, S1531, and S151 of these steps, and will not be described herein again.
On the basis of calculating the standardized sequencing depth of each interval of each sample of the reference human group, the reference human group matrix construction module (reference) is also used for calculating a Z-Score (RZ) taking a target capture interval as a unit, merging RZ values of the target capture intervals covering the same Gene, calculating a Gene Center Score (GCS) of the fluctuation of the sequencing depth of the Gene level, calculating a GCS threshold value aiming at the Gene level according to a preset confidence level, forming a GCS matrix and outputting a GCS file. Prefixes of output paths and output file names of the output COV file and the GCS file are specified by input parameters. The specific method is shown in the above steps S1711 and S1731, and is not described herein again.
A COV matrix construction module (coverage) of a sample to be detected: the module requires inputting the minimum alignment quality (default is 0 if not inputting), outputting the file prefix, the bam file of the reference human group, the path of the bed file of the target capture interval and the fasta file of the human reference genome sequence. By using the bed file processing module, the RC of the sample to be detected in each target capturing interval is counted, the data is used for calculating the sequencing depth of each sample after the data volume standardization, the GC content standardization and the capturing interval length standardization in each interval, and the RC and the standardized sequencing depth jointly form a matrix to output a COV file. The interval standardized sequencing depth calculation method is the same as the reference module. The output path is the current folder, and the prefix of the output file is specified by the input parameters.
GCS calculation module (sigtest) of the sample to be tested: the module requires the input and output file prefix, the reference people group and the path of the COV file of the sample to be tested. The method comprises the steps of calculating Z-score by taking a target capture interval as a unit by utilizing a reference human group and a COV matrix of a sample to be detected, generating an RZ file, calculating GCS by taking a gene as a unit by utilizing the Z-score, and generating a GCS file of the sample to be detected. The RZ and GCS calculation method is the same as the reference module. The output path is the current folder, and the prefix of the output file is specified by the input parameters. The specific method is shown in step S1751 and step S1771, which are not described herein again.
Result output module (call): the module requires the input and output file prefixes, the reference person group and the path of the sample GCS file to be tested. The method comprises the steps of calculating the copy number Ratio of a sample to be detected relative to a reference person group by taking genes as a unit by utilizing information of the reference person group, the sample to be detected GCS and the like, and outputting an SCNA file containing information of GCS, the copy number Ratio (log2Ratio), the copy number (copy number) of the gene of the sample to be detected, the number of target capturing sections (region number) of each gene and the like. The module also finally outputs a file containing the genes which are determined to have copy number variation and the detailed information of the genes on the capturing intervals on the basis of the number of the target capturing intervals of each gene and the GCS (filtering conditions: region number >5, GCS > 95% quantile of GCS of normal population). The output path is the current folder, and the prefix of the output file is specified by the input parameters.
Automated copy number variation identification module (autocall): the COV matrix construction module, the GCS calculation module and the result output module of the sample to be tested are integrated, all identification steps (coverage module + sigtest module + call module) can be completed at one time by inputting the information required by all the three modules at one time, and the copy number identification result is output.
Variant identification visualization module (scatter): the module requires the input of a path of a reference population group, a COV file of a sample to be tested, and a copy number variation result (SCNA) file of the sample to be tested after filtering. The method utilizes information provided in an input file, takes the chromosome as a horizontal axis (the target capture interval is a unit) and the copy number as a vertical axis, draws a scatter diagram to display the copy number of each target capture interval, enables the distribution of copy number amplified and deleted genes on each chromosome to be clear at a glance, and facilitates the explanation of an identification result.
Gender determination module (gusatex): the module requires inputting the minimum comparison quality (default is 0 if not inputting), outputting the file prefix, the bam file of the reference human group, the path of the bed file of the target capture interval and the fastq file of the human reference genome sequence. The method comprises the steps of calling a bed file processing module, counting the RC of a reference human group in each target capturing interval, and judging the sex of a sample by using the RC distribution of the sample to be detected on a sex chromosome and outputting a result. The output path is the current folder, and the prefix of the output file is specified by the input parameters.
The present invention provides two copy number detection modes of operation. The first method comprises the steps of respectively operating a COV matrix construction module (coverage module) of a sample to be tested, a GCS calculation module (sigtest module) of the sample to be tested and a result output module (call module); the second is an automatic copy number variation identification module (autocall module) which integrates the three modules, the automatic copy number variation identification module is operated in a single-thread mode, and only one sample to be detected can be identified at one time.
The embodiment can know that the scheme combines the reference of the normal population baseline gene CNV and the fluctuation level thereof, RC value standardization in the capture interval and gene level copy number statistical scoring, and can effectively improve the CNV detection sensitivity and accuracy of plasma ctDNA.
The scheme provided by the above embodiment is verified by way of example below.
1. Cell line culture
The cell lines NCI-N87(ATCC CRL-5822) and BEAS-2B (ATCC CRL-9609) were purchased from Bai Biotech Co., Ltd, of Nanjing Ke, and cultured according to the provided instructions by adding 10% fetal bovine serum to RPMI-1640 medium and culturing at 37 ℃.
2. Cell DNA extraction
After collecting the cell suspension, the cell suspension was centrifuged at 300g at room temperature for 5 minutes, the supernatant was discarded, the cells were resuspended with 200uLPBS, and then genomic DNA extraction was performed using the QIAamp DNA Mini Kit (cat # 51304; Qiagen, Germany). After lysis, the DNA was purified by column chromatography and eluted with low-TE buffer.
3. Determination of the copy number of ERBB2 in both cell lines by ddPCR
The ddPCR experiment was performed using genomic DNA extracted from the cells as a template. ddPCR was performed using a Berry instrument, a commercial probe, and a reaction system. The reaction system comprises the following components: 10ul ddPCR supermix for probes (no dUTP),1ul ERBB2 probe, 1ul EIF2C1 probe, and 20ng of test DNA. After the reaction system was prepared, chyle was generated according to the method of use of the apparatus, and chyle was aspirated to a 96-well PCR plate and Heat-sealed with a pierce Foil Heat Seal. The conditions of the PCR reaction were: activating enzyme at 95 deg.C for 8 min; melting at 94 ℃ for 30s, annealing and extending at 55 ℃ for 1min, and performing 39 cycles; inactivating enzyme at 98 deg.C for 10 min; and preserving heat at 4 ℃. After PCR amplification, the berle droplet reader reads the number of fluorescent droplets in each reaction well. Ultrapure water was used as a negative control for each batch of reactions. Three duplicate wells are made for each DNA to be tested as a technical duplicate.
The copy number of ERBB2 of BEAS-2B was 2.16, while the copy number of ERBB2 of NCI-N87 was 125.
4. Preparation of ERBB2 copy number gradient samples
The two cell lines were mixed in mass percent as in table one, plasma cfDNA samples were simulated, 10 samples were prepared, and the expected ERBB2 copy number was calculated.
Watch 1
|
NCI-N87
|
BEAS-2B
|
Anticipating CNV
|
sample1
|
100%
|
0.000%
|
125
|
sample 2
|
20%
|
80.000%
|
26.6
|
sample 3
|
10%
|
90.000%
|
14.3
|
sample 4
|
5%
|
95.000%
|
8.15
|
sample 5
|
2%
|
98.000%
|
4.46
|
sample 6
|
1%
|
99.000%
|
3.23
|
sample 7
|
0.50%
|
99.500%
|
2.615
|
sample 8
|
0.20%
|
99.800%
|
2.246
|
sample 9
|
0.10%
|
99.900%
|
2.123
|
sample 10
|
0%
|
100%
|
2 |
5. ddPCR of ERBB2 copy number gradient samples
The ddPCR experiments of 10 samples in the above table were performed in the same manner, and the experimental results obtained can be repeated as shown in Table II, using the cell line DNA samples mixed in mass ratio as the template, adding 20ng of sample DNA to each reaction system, and repeating three wells for each sample.
Watch two
6. Library construction, capture and sequencing of ERBB2 copy number gradient samples
The 10 sample DNAs are firstly broken into DNA fragments of about 200bp by covaris ultrasound, after qubit fluorescence quantification, 20ng of the fragmented DNA is taken as an initial amount to construct a library by adopting KAPA hyper prediction kit (Roche corporation), and the library is obtained by end repair, poly A addition at the 3' end, connection sequencing joint, unbiased amplification of 9 cycles and purification. The details are as follows:
6.1, flush ends and addition of A at the 3' end, the reaction system is as shown in Table three:
watch III
Reagent
|
Volume of
|
Fragmented,double-stranded DNA
|
50μL
|
End Repair&A-Tailing Buffer
|
7μL
|
End Repair&A-Tailing Enzyme Mix
|
3μL
|
Total volume
|
60μL |
The Buffer and enzyme should be mixed in the EP tube in advance, and vortex-mixed with DNA and then reacted as follows. The reaction steps are shown in table four:
watch four
This step of operation sets the PCR tube lid temperature to 85 ℃ instead of 105 ℃. If the next experiment is performed immediately after the end of the run, the end temperature should be set at 20 ℃ instead of 4 ℃.
6.2, connecting the linker, and adopting 7.5uM linker for 20ng of DNA according to the guidance of the library construction instruction. The reaction system was prepared as per table five: watch five
Reagent
|
Volume of
|
Reaction product
|
60μL
|
Volume of joint
|
5μL
|
Ultrapure water
|
5μL
|
Connection Buffer
|
30μL
|
DNA ligase
|
10μL
|
Total volume
|
110μL |
Buffer and enzyme were mixed well in an EP tube, vortexed, centrifuged, and incubated at 20 ℃ for 15 minutes.
6.3, purification after connection: 88ul of Agencour AMPure XP purified magnetic beads are added into the reaction system (110ul) in the last step, and the mixture is fully vortexed, shaken and slightly centrifuged. Adsorbing for 5-15 minutes at room temperature, fully combining the DNA and the magnetic beads, placing the EP tube to a magnetic rack for adsorbing, clarifying the liquid, slowly absorbing the supernatant in the EP tube, and discarding. The EP tubes were incubated with 200. mu.L of 80% ethanol for 30 seconds and the ethanol was slowly aspirated and discarded. The ethanol washing of the magnetic beads was repeated once. The EP tube is dried for 3-5 minutes at room temperature until the ethanol is completely volatilized. The EP tube was removed from the magnetic frame, 22. mu.L of ultrapure water was added, vortexed, centrifuged slightly and incubated at room temperature for 2 minutes to elute DNA, the EP tube was placed on the magnetic frame and adsorbed until the liquid was clear, the supernatant was transferred to a new EP tube, 1. mu.L of the supernatant was taken to measure the DNA concentration, and the remainder was amplified.
6.4, PCR amplification: the PCR system was prepared as in Table six.
Watch six
Reagent
|
Volume of
|
KAPA HiFi HotStart ReadyMix(2X)
|
25μL
|
KAPA Library Amplification Primer Mix
|
5μL
|
(10X)*
|
|
Linker ligation libraries
|
20μL
|
Total volume
|
50L |
After sufficient shaking and rapid centrifugation, PCR reactions were performed according to the seven conditions in Table.
Watch seven
Step (ii) of
|
Temperature of
|
Time of day
|
Number of cycles
|
Pre-denaturation
|
98℃
|
45Sec
|
1
|
Denaturation of the material
|
98℃
|
15Sec
|
|
Annealing
|
60℃
|
30Sec
|
9
|
Extension of
|
72℃
|
30Sec
|
|
Finally extend
|
72℃
|
1min
|
1
|
Preservation of
|
4℃
|
∞
|
1 |
6.5, purification after amplification: adding Agencour AMPure XP purified magnetic beads (50ul) with the same volume as that of a PCR reaction system, fully whirling and oscillating, slightly centrifuging, and adsorbing at room temperature for 5-15 minutes to fully combine the DNA with the magnetic beads. The EP tube is placed on a magnetic frame to be adsorbed until the liquid is clear, and the supernatant in the EP tube is slowly sucked and discarded. 200 μ L of 80% ethanol was added to the EP tube and incubated for 30 seconds, and the ethanol in the EP tube was slowly aspirated and discarded. The ethanol washing of the magnetic beads was repeated once. The EP tube is dried for 3-5 minutes at room temperature until the ethanol is completely volatilized. The EP tube was removed from the magnetic stand, 52. mu.L of ultrapure water was added, vortexed, and slightly centrifuged. The DNA was eluted by incubation at room temperature for 2 minutes, the EP tube was placed on a magnetic rack and the solution was clarified, the supernatant was transferred to a new EP tube, and 1. mu.L of the supernatant was taken for DNA concentration determination.
6.6 before sequencing, a probe capture method is adopted, and a Roche NimbleGen probe is used for enriching and further amplifying a target interval containing the ERBB2 gene to obtain a library of the target interval. And performing computer sequencing after q-PCR quantification.
7. Processing offline fastq data into input files usable by each software
After downloading data, firstly processing the downloaded data from the fastq file into a bam file, wherein the specifically used software and steps are as follows:
7.1 comparison
Calling bwa-0.7.12mem to align each pair of fastq files as paired Reads to the hg19 human reference genome sequence, and generating an initial bam file without using other parameter options except-M parameters and the ID of a specified Reads Group;
7.2 ordering
The SortSam module of picard-2.1.0 is invoked to SORT the initial bam files by chromosome position with the parameter set to "SORT.
7.3 screening
And calling samtools-1.3 view to screen the sorted bam files, and adopting '-F0 x 900' as a parameter.
7.4 creating an index
And calling an index module of samtools-1.3 to establish an index for the finally generated bam file, and generating a bai file matched with the bam file after the marking is repeated.
7.5, repeat labeling
Calling a MarkDuplicates module of picard-2.1.0 to mark the repeated sequences in the screened bam file, filtering the repeated sequences during subsequent copy number analysis, and analyzing by using the data after duplication removal;
8. identification of copy number of ERBB2 copy number gradient samples with other software
8.1、CNVkit
30 blood cell samples with no abnormal copy number are selected as a reference sample group, and capture sequencing and preprocessing of sequencing data are carried out in the same manner. Taking the bam files of 30 samples and the bed files for recording the capture interval, the human reference genome sequence fasta files and the bam files of each ERBB2 copy number gradient sample as input files, adopting default parameters, respectively identifying the copy number of each ERBB2 copy number gradient sample, and obtaining the log2Ratio of each sample relative to the copy number of the reference sample group. Since the CNVkit has no gene-based identification result, the log2Ratio of the CNV fragment containing the ERBB2 gene is taken as the identification result. The log2Ratio is converted to a copy number. The results are shown in table eight:
table eight
Compared with the identification result of ddPCR, the identification result of sample 1 is lower, the results of samples 2-6 are relatively accurate, and the results of samples 6-9 are greatly different. For samples with copy numbers below 3, the CNVkit sensitivity was poor.
8.2、Control-FREE
In identifying copy number variation, sample 10 (i.e., the BEAS-2B cell line) was used as a reference sample, and the bam file of this sample, the bed file of the capture interval, the human reference genomic sequence fasta file, and the bam file of each ERBB2 copy number gradient sample were used as input files. And (3) adopting Control-FREE official recommendation parameters aiming at the captured sequencing data for all parameters, and respectively trying two modes of non-correction identification and correction identification according to purity. Since there was no gene-based identification, the CNV fragment containing ERBB2 gene was taken as the identification of ERBB 2. The results are shown in Table nine:
watch nine
9. Compared with the identification result of ddPCR, the results of samples 1-6 are relatively accurate without correction, and one of three replicates of sample7 is unidentified, and neither sample 8 nor sample 9 is identified as CNV. For samples with copy numbers below 3, Control-FREE was less sensitive. In addition, when the purity is estimated and the result is corrected according to the purity, the deviation of the identification of the purity by the Control-FREE from the actual condition is large, the difference between the corrected result and the result of ddPCR is larger than that in the case of no correction, and the reliability is very poor.
Identification of copy number of ERBB2 copy number gradient samples with the present invention
30 blood cell samples with no abnormal copy number are selected as reference human group samples, and capture sequencing and preprocessing of sequencing data are carried out on the blood cell samples in the same mode. The bam files of 30 samples, the bed files of the recording capture interval and the fasta files of the human reference genome sequence are used as input files, and the COV and GCS files of the reference human group are generated by adopting the reference module.
Inputting a bam file of ERBB2 copy number gradient samples and COV and GCS files of reference people groups, respectively identifying the copy number of the gene covered by the captured interval of each sample by adopting the autocall module of the invention, and obtaining the RZ, COV and GCS files of each sample and two final SCNA result files.
According to the SCNA result file, the identification result of the ERBB2 gene copy number of each concentration gradient sample in the invention is shown in the table ten:
watch ten
As can be seen from the comparison of Table eight, Table nine and Table ten, using the protocol provided by the present invention, it was possible to stably identify the ERBB2 copy number amplification in sample 1-9, and the results of sample 2-9 were all very accurate, except for a certain deviation between the results of sample 1 and ddPCR. Even for samples with very low copy numbers, the invention can stably identify the copy numbers, carry out the statistical test on GCS and give reliable and sensitive test results. Compared with other CNV identification algorithms, the method is more stable, has higher sensitivity and lower detection lower limit no matter the CNV identification is carried out on high and low copy numbers, and is more suitable for the CNV identification of ctDNA samples.
Furthermore, a scatter module can be used to perform chromosome partition copy number variation display on CNV identification results of sample 3 and sample 5, fig. 3 is a schematic diagram of CNV identification result display according to an embodiment of the present application, and it can be known from fig. 3 that the ordinate in the diagram represents copy number (copy number), the abscissa represents CNV sample, and CNV of ERBB2 can be clearly seen on chromosome 17, where ERBB2 copy number of sample 3 is about 7, and ERBB2 copy number of sample 5 is about 4.