Method and system for evaluating tumor heterogeneity
Technical Field
The present invention is in the field of biotechnology, and more particularly, the present invention relates to methods and systems for assessing tumor heterogeneity.
Background
Tumors are a disease caused by genetic changes. Tumors often involve a variety of types of genetic Variations, including Single Nucleotide Variations (SNV), short indels (indels), Copy Number Variations (CNV), Structural Variations (SV), and the like. The process of variant accumulation begins as the first variant develops in the tumor cells. Over time and the evolution of tumors, the deleterious variations that occur first create favorable maintenance conditions for later-occurring variations, allowing tumor cells to constantly acquire or enhance capabilities such as inhibiting apoptosis, unlimited replication, immune escape, etc., and thus tumor cells to accumulate variations much faster than normal cells. The final tumor that is formed is indeed a mixture of cell populations with different genetic characteristics: some cells carry only early-stage variation, and some carry later-stage variation; in these tumor cells, the proportion of cells involved in the mutation also decreases from early to late as the time of occurrence; the simultaneous variation in a cell is co-conserved and died out during tumor evolution, involving the same proportion of cells. The complexity of the distribution of the varying cellular proportions in the tumor can reflect tumor heterogeneity, the latter being the most direct and important manifestation of tumor complexity, which is closely related to tumor patient prognosis and survival time.
At present, the heterogeneity of tumors is mostly evaluated by adopting a method of multi-site sampling and high-throughput sequencing of the same tumor patient, namely, after pathological sampling is carried out on a plurality of positions or a plurality of focuses of tissues of the patient, the variation of each sampling part is analyzed by a high-throughput sequencing method, and description and hierarchical statistics are carried out on the common variation and the cell proportion corresponding to the variation. This method has the following disadvantages: (1) the multi-site clinical sampling has bias, can only represent the molecular variation characteristics of the taken part, and cannot represent the complexity of the whole tumor; (2) has certain clinical risk; (3) some types of metastases are difficult to obtain, such as pleural/peritoneal metastases; (4) inaccuracies, the heterogeneity analysis method by common variations, identifies common variations as the same level, does not specifically partition common variations, thus leading to inaccuracies in the partial analysis results (Gerlinger, M.et. internal. statistical and branched analysis reported by multiple regional analysis. the New England and lateral of media 366, 883. once: 10.1056/NEJMoa1113205 (2012); Hao, J.J.et. spatial internal statistical analysis and hierarchical analysis in analytical cell of environmental cells nuclear, doi:10.1038/ng.3683 (2016)). In addition, there are methods to assess tumor heterogeneity only by the copy number variation results of single-point sampling (oxygen, l., Satas, G. & Raphael, b.j. quantitative genetic in recent-genome and recent-isomer sequencing data. bioinformatics30,3532-3540, doi:10.1093/bioinformatics/btu651(2014)), which have the disadvantage of low population coverage, i.e., they can only cover cancer or population with a large amount of copy number variation, in addition to the disadvantage of sampling bias.
Therefore, there is a need in the art for analytical methods to more accurately assess tumor heterogeneity to effectively aid in tumor prognosis and treatment planning.
Disclosure of Invention
In order to more accurately evaluate the heterogeneity of tumors, the invention provides a Molecular Clone (mClone) analysis method, which is based on the detection results of multiple types of variation in circulating tumor DNA (ctDNA) by high-throughput sequencing, divides all the variation into different Molecular clones, and evaluates the heterogeneity of tumors by using Molecular Clone levels. The method of the invention realizes tumor heterogeneity assessment based on ctDNA high-throughput variation detection, and effectively assists tumor prognosis and treatment scheme formulation.
Accordingly, in a first aspect, the present invention provides a method of assessing tumor heterogeneity, the method comprising:
1) sequencing (preferably high-throughput sequencing) free DNA (cell-free DNA, cfDNA) of a patient to obtain sequencing information;
2) determining ctDNA variation by using the sequencing information, calculating variation allelic frequency according to the sequencing information and the determined ctDNA variation, determining the actual total copy number of the region where the variation is located, and calculating the ratio of ctDNA to cfDNA;
3) clustering the ctDNA variation according to the proportion determined in the step 2), and sequencing information and copy number information of the ctDNA variation, wherein each cluster obtained by clustering is determined as a molecular clone to obtain a clustered clone level;
4) assessing the patient's tumor heterogeneity based on their clonal hierarchy, the more clonal hierarchies the more heterogeneous the patient's tumor.
In a second aspect, the present invention provides a method of comparing tumor heterogeneity in different patients, the method comprising:
calculating a molecular clone hierarchy for each of said patients using steps 1) -3) of the method of the first aspect of the invention, the more clone hierarchies the more heterogeneous tumours for different patients.
In one embodiment of the first or second aspect of the present invention, step 2) comprises:
2.1) obtaining a variant V (said variant V being selected from SNV, indel and SV) (V) using said sequencing informationiReference allelic sequencing depth (R) for i-1, …, ni) Depth of variant allelic sequencing (M)i) And calculating Variant Allelic Frequency (VAF)i),
Wherein the reference allelic sequencing depth (R)i) The number of the normal sequences without the variation at the corresponding sites in the sequencing result; depth of variant allelic sequencing (M)i) The number of the variant sequences of which the variation occurs at the corresponding site in the sequencing result;
2.2) Using the mutation ViCNV (CNV) of the regioniI 1, …, n), calculating the variation ViReference copy number of the region (rCN)i) And actual total Copy Number (CN)i),
If an accurate CNV detection method (e.g., using SNP chip detection) is used in step 1), allele-specific Copy Number Variations (CNV) on both chromosomes are obtained for variations that are not on the male sex chromosomei,major,CNVi,minor,CNVi,major≥CNVi,minor) Information on the actual allele-specific Copy Number (CN)i,major,CNi,minor),
2.3) ctDNA ratio assessment: the percentage CtDNA (CTF) in cfDNA was evaluated with the maximum variant allelic frequency,
CTF=max(VAFi) I 1, …, n (equation 5)
In one embodiment, in step 3) of the method of the invention, the variants are clustered by the predicted variant cell proportion, for example using PyClone (v0.13, the current latest version, which is referred to below as "version" unless otherwise specified) software.
In one embodiment, in step 3) of the method of the invention, the reference for variant V (SNV/indel/SV) and the variant allelic depth data (R)i,Mi): used to evaluate the proportion of variant tumor cells together with CTF and CNV. In one embodiment, in step 3) of the method of the present invention, the proportion of the cell population in which each of the mutations is present in all tumor cells is predicted using PyClone software, and the software parameters can be set as follows: total tumor cell ratio (CTF) ═ highest value of variant allele frequencies; the iteration number is 20000; other parameters are defaults.
In one embodiment, in step 3) of the method of the invention, the n detected variant V (SNV/indel/SV) are clustered using PyClone, with default parameters except for the following parameters:
(a)--tumour_contentsCTF;
(b)--num_iters 20000;
(c) -private total _ copy _ number when using allele-specific CNV data as input
When the parameter is set to the partial _ copy _ number;
(d) -dense pyclone beta binding, which is set to pyclone binding when the whole genome sequencing technique with a lower sequencing depth is used in step 1);
(e) -in _ files property. tsv, a property. tsv file is a file with tabs as dividers; each row contains information of one variant V (SNV/indel/SV) in addition to the header row; the paint comprises six columns which are as follows in sequence: motion _ id, ref _ counts, var _ counts, normal _ cn, minor _ cn, and major _ cn.
In a third aspect, the present invention provides a system for assessing tumor heterogeneity, the system comprising:
1) a module for sequencing (preferably high-throughput sequencing) cfDNA of a patient;
2) means for performing the steps of:
a) receiving sequencing information from module 1);
b) obtaining ctDNA variations in cfDNA by comparing with sequence information of a normal gene sequence;
c) calculating variant allelic frequency according to the sequencing information and the ctDNA variation, determining the actual total copy number of the region where the variation is located or the actual allele-specific copy number, and calculating the ratio of ctDNA to cfDNA;
d) clustering the ctDNA variation according to the proportion determined in the step c) and sequencing information and copy number information of the ctDNA variation to determine molecular cloning, and calculating a molecular cloning level;
3) a result output module:
outputting a result of tumor heterogeneity based on the patient's molecular clonal hierarchy, the more clonal hierarchies the higher the tumor heterogeneity of the patient.
In a fourth aspect, the present invention provides a system for comparing tumor heterogeneity in different patients, the system comprising:
1) a module for sequencing (preferably high-throughput sequencing) cfDNA of a patient;
2) means for performing the steps of:
a) receiving sequencing information from module 1);
b) obtaining ctDNA variations in cfDNA by comparing with sequence information of a normal gene sequence;
c) calculating variant allelic frequency according to the sequencing information and the variant result, determining the actual total copy number or the actual specific copy number of the allele of the region in which the variant is positioned, and calculating the ratio of ctDNA to cfDNA;
d) clustering the ctDNA variation according to the proportion determined in the step c) and sequencing information and copy number information of the ctDNA variation to determine molecular cloning, and calculating a molecular cloning level;
3) a result output module:
comparing the molecular clone levels of different patients and outputting the result of comparing the tumor heterogeneity of different patients, wherein the more clone levels of patients, the higher the tumor heterogeneity.
In one embodiment of the third or fourth aspect of the present invention, step c) of module 2) comprises the steps of:
c.1) obtaining a variant V (said variant V being selected from SNV, indel and SV) (V) using said sequencing informationiReference allelic sequencing depth (R) for i-1, …, ni) Depth of variant allelic sequencing (M)i) And calculating Variant Allelic Frequency (VAF)i),
Wherein the reference allelic sequencing depth (R)i) The number of the normal sequences without the variation at the corresponding sites in the sequencing result; depth of variant allelic sequencing (M)i) The number of the variant sequences of which the variation occurs at the corresponding site in the sequencing result;
c.2) utilizing the mutation ViCNV (CNV) of the regioniI 1, …, n), calculating the variation ViReference copy number of the region (rCN)i) And actual total Copy Number (CN)i),
If an accurate CNV detection method (e.g. using SNP chip detection) is used in step 1),for variations that are not on the male sex chromosome, allele-specific Copy Number Variations (CNV) on both chromosomes are obtainedi,major,CNVi,minor,CNVi,major≥CNVi,minor) Information on the actual allele-specific Copy Number (CN)i,major,CNi,minor),
c.3) ctDNA ratio assessment: the percentage CtDNA (CTF) in cfDNA was evaluated with the maximum variant allelic frequency,
CTF=max(VAFi) I 1, …, n (equation 5)
In one embodiment of the third or fourth aspect of the invention, module 2) is a computer readable medium of instructions for performing the steps. Module 3) is a computer readable medium of instructions to perform the steps.
In one embodiment, in step d) of module 2) of the system of the invention, the reference of variant V (SNV/indel/SV) and the variant allelic depth data (R) arei,Mi): used to evaluate the proportion of variant tumor cells together with CTF and CNV. In one embodiment, in step d) of module 2) of the system of the invention, the proportion of the cell population in which each of the variants is present in all tumor cells is predicted using PyClone software, and the software parameters can be set as follows: total tumor cell ratio (CTF) ═ highest value of variant allele frequencies; the iteration number is 20000; other parameters are defaults.
In one embodiment, in step d) of module 2) of the system of the invention, the variants are clustered by the predicted variant cell proportion, for example using PyClone software.
In one embodiment, in step d) of module 2) of the system of the invention, the detected n variant V (SNV/indel/SV) are clustered using PyClone, with default parameters except for the following parameters:
(a)--tumour_contentsCTF;
(b)--num_iters 20000;
(c) -private total _ copy _ number when using allele-specific CNV data as input
When the parameter is set to the partial _ copy _ number;
(d) -dense pyclone beta binding, which is set to pyclone binding when the whole genome sequencing technique with a lower sequencing depth is used in block 1);
(e) -in _ files property. tsv, a property. tsv file is a file with tabs as dividers; each row contains information of one variant V (SNV/indel/SV) in addition to the header row; the paint comprises six columns which are as follows in sequence: motion _ id, ref _ counts, var _ counts, normal _ cn, minor _ cn, and major _ cn.
The invention provides a heterogeneity assessment method which is more in line with the tumorigenesis and development rule based on the tumor evolution theory and the ctDNA high-throughput variation detection technology and analyzes tumor variation from the clone level.
The present invention finds that higher tumor heterogeneity has higher risk of tumor progression.
Compared with other analysis methods, the advantages of the invention are as follows:
1) comprehensiveness of information: ctDNA can reflect more comprehensive tumor molecular characteristics relative to tissue sampling bias at single or multiple sites;
2) sampling convenience: tissue sampling usually comes from surgery or puncture, and compared with tissue sampling, especially multi-site tissue sampling, ctDNA detection only needs noninvasive blood sampling and is easier and more feasible clinically;
3) high accuracy: the heterogeneity is evaluated from the clonal surface rather than the variant surface on the basis of the tumor evolution theory by fully utilizing variant information, covering SNV, indel and SV, reserving the specific frequency of the variant rather than utilizing detected/undetected binary values.
By means of the three points, the method and the system can more accurately and reasonably evaluate the heterogeneity of the tumor.
Drawings
The invention is illustrated by the following figures.
Fig. 1 is a flow chart of mClone analysis, with the steps marked by x being performed separately for each patient.
Fig. 2 survival analysis, high heterogeneity for the left curve and low heterogeneity for the right curve.
Detailed Description
In the present invention, the name of the Gene is given by Official designation (Official Symbol) in NCBI-Gene, and the Gene mutation and the protein mutation are expressed by common expression in the art. For example, c.518t > C (p.v173a) represents a missense mutation, indicating a change of the T base at position 518 of the coding region to a C base, resulting in a mutation of the amino acid at position 173 from histidine V to arginine a; c.2235-2249 delGGAATTAAGAGAGAAC (p.E746-A750 del) indicates a small fragment deletion, indicating the deletion of bases GGAATTAAGAGAAGC from position 2235 to 2249 of the coding region, resulting in the deletion of 5 amino acids from position 746 to 750; c.2663+1G > A represents a splicing mutation, and represents that the first base of an intron which is closely connected with the 3 end of the exon where the 2663 th site of the coding region is changed from G to A; c.7081c > T (p.q2361 x) represents a nonsense mutation, changing the C base at position 7081 of the coding region to a T base, resulting in a Q at position 2361 to a stop codon.
In the present invention, the mathematical notation ceil refers to rounding up.
In the present invention, cfDNA may also be sample DNA of blood (plasma), saliva, pleural effusion, urine, feces, and the like.
In the present invention, the tumor is selected from, but not limited to: lung cancer, colorectal cancer, gastric cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer. In a specific embodiment, the tumor is lung cancer and the variation is a variation listed in table 1.
The flow chart of the method of the invention is shown in fig. 1, and for each tested patient, after ctDNA variation is detected by high-throughput sequencing, the ratio of ctDNA to cfDNA is evaluated according to the sequencing result of ctDNA variation; the above ratios, together with the detected variations, are used as input to cluster the variations, each cluster obtained by clustering is determined to be a molecular clone, then the clone levels are calculated, and finally the tumor heterogeneity of each patient is evaluated according to the clone levels of all patients. The present inventors found that, for lung cancer, patients with high heterogeneity were found to have a clone score of more than 3.5, and patients with low heterogeneity were found to have a clone score of less than 3.5.
The following is a description of the main technical process and principle of the method of the present invention:
1. high throughput sequencing to detect ctDNA variations
First, for a plurality of patients of the same cancer species selected as subjects, mutation detection and parameter calculation were performed for each patient:
1) sequencing cfDNA of a subject by high-throughput sequencing technologies such as whole genome, whole exome or probe capture sequencing and corresponding informatics analysis methods to obtain variations contained in the ctDNA, including SNV, indel, SV, CNV and the like;
2) obtaining variant V (variant V is selected from SNV, indel and SV) (V) according to the sequencing result in the step 1)iReference allelic sequencing depth (R) for i-1, …, ni) Depth of variant allelic sequencing (M)i) And calculating Variant Allelic Frequency (VAF)i),
Wherein the reference allelic sequencing depth (R)i) The number of the normal sequences without the variation at the corresponding sites in the sequencing result; depth of variant allelic sequencing (M)i) The number of the variant sequences of which the variation occurs at the corresponding site in the sequencing result;
3) using variation ViCNV (CNV) of the regioniI 1, …, n), calculating the variation ViReference copy number of the region (rCN)i) And actual total Copy Number (CN)i),
If the precise CNV detection method (e.g., SNP chip detection) is used in 1), allele-specific Copy Number Variation (CNV) on both chromosomes is obtained for variation not on male sex chromosomesi,major,CNVi,minor,CNVi,major≥CNVi,minor) Information on the actual allele-specific Copy Number (CN)i,major,CNi,minor),
Accurate CNV detection refers to obtaining allele-specific copy number variation of both chromosomes, for example using SNP chip detection.
2. Variant clustering and clone-level computation
Then, for each patient, cluster analysis and clone hierarchy calculation were performed on the detected variation according to the parameters obtained in 1:
1) ctDNA ratio evaluation: the percentage CtDNA (CTF) in cfDNA was evaluated with the maximum variant allelic frequency,
CTF=max(VAFi) I 1, …, n (equation 5)
2) Variant clustering:
for any variation (SNV/indel/SV), the source cells of cfDNA are classified into three categories: the ratio of normal cells (N), tumor cells not carrying the mutation (C0) and tumor cells carrying the mutation (C1), wherein the ratio of the tumor cells carrying the mutation (C1) to all the tumor cells (C1+ C0) is called the ratio of the mutant tumor cells, and if the ratio of the two or more mutant tumor cells is equivalent, the occurrence time of the two or more mutant tumor cells is similar, and the two or more mutant tumor cells are endowed with the same cluster label and are clustered into a cluster, namely a molecular clone.
Therefore, the following data are needed for mutation clustering:
a) reference and variant allelic depth data (R) for variant V (SNV/indel/SV)i,Mi): for assessing the proportion of variant tumor cells with both CTF and CNV;
b) reference copy number (rCN) in step 1.3)i) And actual total Copy Number (CN)i) Or the actual allele-specific Copy Number (CN)i,major,CNi,minor): for a certain variation, the amplification or deletion of the copy number of the variant allele can cause the false increase or false decrease of the estimated value of the proportion of the variant tumor cells, so that the genotype of the C1 cells can be more accurately judged by adding the copy number variation data, the variation frequency is corrected, and the proportion of the variant tumor cells is correctly evaluated;
c) CTF: to estimate the composition of cfDNA-derived cells, i.e. the proportion of tumor cells (C0+ C1) among all cells (N + C0+ C1), accurate setting of this parameter helps to correctly calculate the quantitative ratio of reference alleles from normal cells to reference alleles from tumor cells.
For example, the n detected variant V (SNV/indel/SV) are clustered using PyClone V0.13 (current latest version), with default parameters except for the following parameters:
(a)--tumour_contentsCTF;
(b)--num_iters 20000;
(c) -private total _ copy _ number when using allele-specific CNV data as input
When the parameter is set to the partial _ copy _ number;
(d) -dense pyclone beta binding, which parameter is set to pyclone binding when 1.1) a low sequencing depth whole genome sequencing technique is used;
(e) -in _ files property. tsv, a property. tsv file is a file with tabs as dividers; each row contains information of one variant V (SNV/indel/SV) in addition to the header row; the paint comprises six columns which are as follows in sequence: motion _ id, ref _ counts, var _ counts, normal _ cn, minor _ cn, and major _ cn.
PyClone(Roth,A.et al.PyClone:statistical inference of clonal
population structure in cancer.Nature methods 11,396-398,
10.1038/nmeth.2883(2014) estimates V from the variant V (SNV/indel/SV) and CNV informationiThe cells in the tumor occupy the proportion of all tumor cells, and each variation is assigned with a cluster label (C)i,i=1,…,n,CiE {1, …, c }, c being the number of clusters).
Other versions of PyClone or other variant clustering software may also be employed for variant clustering.
3) And (3) clone level calculation:
the clone level, i.e., the number of molecular clones mutated to aggregate c. In the process of tumor development, the structure of the tumor evolution tree is gradually enlarged and complicated, the molecular cloning is more, the cloning level is deepened continuously, and therefore the size of the cloning level is closely related to the tumor heterogeneity.
3. Assessment of tumor heterogeneity
Taking the median of the clone levels of all tested patients as a threshold value for judging the high/low tumor heterogeneity of each patient; patients with clonal hierarchy below this threshold have less tumor heterogeneity, whereas tumor heterogeneity is higher.
Since genomic variation varies significantly between cancer species, the methods of the invention do not suggest comparing heterogeneity across cancer species.
In the method of the present invention, other steps than the sequencing step may be present in the form of instructions in a computer readable medium, and the instructions in the computer readable medium may be read by a computing device to perform other steps of the method of the present invention, as long as the sequencing result is input to the computing device. Including but not limited to a computer, portable computer, PAD, smartphone, smart wrist, etc.
Examples
In this example, 10 lung cancer patients are taken as an example to explain the present invention. It should be noted that the examples are for illustrative purposes only and should not be construed as limiting the present application in any way.
List of variants detected by ctDNA high throughput sequencing
1) Variation V (SNV/indel/SV)
2-8 mutations were detected in 10 cases of lung cancer patients, and the detection list of the mutations V (SNV/indel/SV) is shown in Table 1.
TABLE 1 list of variant V (SNV/indel/SV) detection
2)CNV
Of 10 lung cancer patients, only S5 detected EGFR amplification at a fold of 1.73, as shown in Table 2. Therefore, the actual total copy number corresponding to the EGFR Deletion mutation detected in S5 is estimated to be 4.
TABLE 2CNV detection List
Sample numbering
|
Gene
|
State of copy number variation
|
Multiple of copy number variation
|
S5
|
EGFR
|
gain
|
1.73 |
Statistics of mClone analysis results
Pyclone clustering
The detected variants were clustered using PyClone v0.13, with default parameters except for the following:
a)--tumour_contents
b)--num_iters 20000
c)--prior total_copy_number
d)--density pyclone_beta_binomial
e)--in_files
parameters a) and e) specify the CTF and the input file, respectively. The CTF and input file contents for each patient are shown in table 3:
table 3Pyclone input data
Wherein, mutation _ id represents mutation number, ref _ counts represents reference count, var _ counts represents mutation count, normal _ CN represents normal copy number, i.e. CNiMinor _ CN represents a small copy number, i.e. CNi,minorThe major _ CN represents the large copy number, i.e. CNi,major。
The results of mClone analysis and subsequent follow-up data using the method of the present invention are shown in table 4, and the median of all clone levels, i.e., cut-off is 3.5, with clone levels greater than 3.5 being patients with high heterogeneity and clone levels less than 3.5 being patients with low heterogeneity.
TABLE 4 comparison table of mClone analysis results and clinical information
Sample numbering
|
Clonal hierarchy
|
Tumor heterogeneity
|
Progression-free survival (week)
|
S1
|
2
|
Is low in
|
54
|
S2
|
1
|
Is low in
|
49
|
S3
|
4
|
Height of
|
11
|
S4
|
4
|
Height of
|
27
|
S5
|
6
|
Height of
|
9
|
S6
|
6
|
Height of
|
17
|
S7
|
3
|
Is low in
|
17
|
S8
|
3
|
Is low in
|
34
|
S9
|
5
|
Height of
|
22
|
S10
|
2
|
Is low in
|
36 |
Survival analysis of this sample (see figure 2) revealed that tumor heterogeneity results using clonal level assessment had a significant predictive effect on patient prognosis (progression-free survival) (p, 0.044), with higher tumor heterogeneity with higher risk of progression (risk ratio 9.386). The results verify the effectiveness and accuracy of assessing tumor heterogeneity using mClone analysis techniques.
The molecular clone hierarchy obtained by the molecular clone mClone analysis method can be used for evaluating the heterogeneity of the tumor, the heterogeneity of the tumor represents the development stage of the tumor, and the larger the heterogeneity, the later the tumor of the patient is, and the more the tumor of the patient is developed in the near term. The above experimental data confirm this.