CN112863594A - Tumor purity estimation method and device - Google Patents
Tumor purity estimation method and device Download PDFInfo
- Publication number
- CN112863594A CN112863594A CN202110350647.7A CN202110350647A CN112863594A CN 112863594 A CN112863594 A CN 112863594A CN 202110350647 A CN202110350647 A CN 202110350647A CN 112863594 A CN112863594 A CN 112863594A
- Authority
- CN
- China
- Prior art keywords
- variation
- copy number
- region
- single point
- point variation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 195
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012937 correction Methods 0.000 claims abstract description 39
- 238000001514 detection method Methods 0.000 claims abstract description 35
- 210000000349 chromosome Anatomy 0.000 claims abstract description 10
- 230000035772 mutation Effects 0.000 claims description 32
- 108700028369 Alleles Proteins 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 5
- 238000013075 data extraction Methods 0.000 claims description 2
- 230000002159 abnormal effect Effects 0.000 abstract description 15
- 210000004027 cell Anatomy 0.000 description 19
- 238000010586 diagram Methods 0.000 description 14
- 238000012163 sequencing technique Methods 0.000 description 11
- 210000004881 tumor cell Anatomy 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 210000001519 tissue Anatomy 0.000 description 10
- 239000000203 mixture Substances 0.000 description 9
- 238000012545 processing Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 7
- 201000011510 cancer Diseases 0.000 description 6
- 230000002759 chromosomal effect Effects 0.000 description 5
- 238000010367 cloning Methods 0.000 description 5
- 230000003321 amplification Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 4
- 230000002068 genetic effect Effects 0.000 description 4
- 238000003199 nucleic acid amplification method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000012417 linear regression Methods 0.000 description 3
- 230000000392 somatic effect Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000007614 genetic variation Effects 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 102000019034 Chemokines Human genes 0.000 description 1
- 108010012236 Chemokines Proteins 0.000 description 1
- 208000031404 Chromosome Aberrations Diseases 0.000 description 1
- 206010065163 Clonal evolution Diseases 0.000 description 1
- 206010067477 Cytogenetic abnormality Diseases 0.000 description 1
- 230000007067 DNA methylation Effects 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013398 bayesian method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000002950 fibroblast Anatomy 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000011331 genomic analysis Methods 0.000 description 1
- 210000002865 immune cell Anatomy 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 201000010225 mixed cell type cancer Diseases 0.000 description 1
- 208000029638 mixed neoplasm Diseases 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012314 multivariate regression analysis Methods 0.000 description 1
- 235000015097 nutrients Nutrition 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000004614 tumor growth Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A tumor purity estimation method and device can be used in the technical field of data science, the technical field of finance or other fields. The method comprises the following steps: obtaining a detection result file from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; clustering the normal region single-point variation site set to obtain a clustering result, determining the average read logarithm of the single-point variation of the normal region and determining the clone copy number of the copy number variation region corresponding to a plurality of clones in a tumor sample according to the clustering result; and determining a correction parameter set of each variation section in the copy number variation region according to the clone copy number of the copy number variation region, correcting the single-point variation difference reading logarithm of the copy number variation region by using the correction parameter set, and estimating the tumor purity according to the corrected single-point variation difference reading logarithm of the copy number variation region. The method effectively corrects the number of abnormal single-point variation read pairs, and realizes accurate estimation of tumor purity under different coverage degrees and different tumor purities.
Description
Technical Field
The invention relates to the technical field of data science, in particular to a method and a device for estimating tumor purity.
Background
Tumors are produced by the accumulation of normal cellular genomic variations, and one of the major causes of death is malignant tumors. The tumor tissue is very complex in composition, and not only contains tumor cells, but also contains important non-cancer cells such as immune cells, fibroblasts and the like, and nutrients, chemokines and the like for promoting and inhibiting the growth of tumors. Tumor cells in tumor tissue are heterogeneous cell populations comprising polyclonal cells with different genetic mutations, which can be broadly divided into primary variant cells and cells resulting from multiple rounds of selection and clonal expansion of the cells, and at the nucleotide level, it is unlikely that two tumor cells will be identical. The clone that exhibits the most ancestral features, estimated by sequencing data, is called the initial clone, whose genome accumulates multiple chromosomal variations, including structural and single point variations. It is assumed that tumor cells in a tumor sample meet an infinite site hypothesis in the evolution process, i.e., a site is mutated at most once in the whole evolution process, and the mutated site cannot be recovered. The subclones mutated after early clonal amplification and recent clonal amplification are called as daughter clones, and the daughter clones inherit the chromosomal variation of the parent and generate variation which is different from the parent and is beneficial to the self, so that different subclones contain different chromosomal variations.
Copy Number Variations (English name: Copy Number variants, English abbreviation: CNVs) are an important class of sign aberrations in the tumor genome and have been extensively studied in order to understand cancer mutations and clonal evolution. Copy number variation causes deviation of genome information of tumor tissues, the number of pairs read from a single-point variation site on a gene fragment with copy number variation is multiplied compared with that of a gene fragment without copy number variation, and the external tumor is a heterogeneous cell population, so that the copy number variation conditions on different clone structures are different. When copy number variation appears on a parent clone, a child clone not only inherits the variation of the parent clone, but also can generate own copy number variation; when a copy number variation occurs on a child clone, the copy number variation may result in a doubling of the chromosomal variation that the child clone inherits from the parent clone, and may also result in a doubling of the chromosomal variation that the child clone generates to favor itself.
Tumor purity estimation refers to the accurate assessment of the proportion of tumor cells from mixed tumor tissue sequencing data, which is extremely complex and differs in various cancer types, sequencing types and sampling tissues. Tumor purity can be estimated not only by pathologists through visualization or graphical analysis of tumor cells, but also with the development of genomic techniques such as statistical methods for linear models, maximum likelihood models, bayesian methods, etc. computational methods can be used to infer tumor purity, and different types of genomic information such as gene expression, copy number variation, somatic variation, DNA methylation, etc. are used. Depending on the data used, tumor purity estimation methods fall roughly into two categories: the first type is based on SNP array data; the second category is based on sequencing data. For the first method based on SNP array data, the method detects chromosome abnormality (copy number abnormality, heterozygosity) in cells by using high throughput data obtained by single nucleotide polymorphism microarray experimental technology, thereby estimating the purity of tumor tissues, including ABSOLUTE, ASCAT and the like. For the second category of Sequencing Data-based methods, the methods are directed to the use of Cancer Sequencing Data (Cancer Sequencing Data), including puriteest, AbsCN _ seq, CNAnorm, THetA, and PurBayes, among others.
However, the estimation performance of both methods is not good, and the following problems mainly exist: first, the first method works only for a single subclone case, and thus is not well suited for multiple subclone analysis when multiple subclones are present within a tumor cell; secondly, the second method has limited effect on tumor purity estimation under the condition of coexistence of multiple subclones; thirdly, for the coexistence of copy number variation and multi-stage subcloning, the performance of the above two methods is limited, and the purity of tumor cells cannot be accurately estimated.
Disclosure of Invention
In view of the problems in the prior art, an embodiment of the present invention provides a method and an apparatus for estimating tumor purity, which achieve accurate estimation of tumor sample purity for copy number variation.
In order to achieve the above object, an embodiment of the present invention provides a tumor purity estimation method, including:
obtaining a detection result file obtained by checking a normal sample and a tumor sample from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample;
clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result;
determining copy number of clone copies of the copy number variation region corresponding to the multiple clones in the tumor sample according to the normal region single point variation average read logarithm and the copy number variation region single point variation site set;
and determining a correction parameter set of each variation section in the copy number variation region according to the copy number of clone in the copy number variation region, correcting the copy number variation region single point variation read-difference logarithm in the copy number variation region single point variation site set by using the correction parameter set, and estimating the tumor purity according to the corrected copy number variation region single point variation read-difference logarithm.
Optionally, in an embodiment of the present invention, the obtaining variant information data by using the detection result file includes:
extracting a single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set of the tumor sample from the detection result file;
determining a copy number variation region single point variation site set of a tumor sample according to the initial position, the termination position and the length of each copy number variation in the copy number variation set;
and determining a normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.
Optionally, in an embodiment of the present invention, the clustering the set of single point variant loci in the normal region to obtain a clustering result includes:
extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;
and clustering the single point variation sites in the normal region single point variation site set according to the characteristics of the single point variation sites to obtain a clustering result.
Optionally, in an embodiment of the present invention, the determining the average logarithm of single-point variation of the read log of the normal region according to the clustering result includes:
and determining the average value of the characteristics of the single-point variation sites in each class according to the clustering result, and taking the average value as the average read logarithm of the single-point variation in the normal region.
Optionally, in an embodiment of the present invention, the determining a correction parameter set of each variant segment in the copy number variant region according to the clone copy number of the copy number variant region includes:
determining a single-point variation set corresponding to each variation section in the copy number variation region according to the clone copy number of the copy number variation region;
determining multiple groups of ratio values of the single point variation read logarithms in the single point variation sets corresponding to the variation sections according to the single point variation read logarithms in the copy number variation regions in the single point variation site sets of the copy number variation regions to obtain multiple ratio value sets;
determining the sum of the estimated allele frequency errors of the corresponding single-point variations according to the ratio value set;
and when the sum of the estimated allele frequency errors is minimum, taking the ratio value set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.
The embodiment of the invention also provides a tumor purity estimation device, which comprises:
the data acquisition module is used for acquiring detection result files obtained by checking normal samples and tumor samples from the chromosome variation detection tool and obtaining variation information data by using the detection result files; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample;
the average read logarithm module is used for clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result;
a clone copy number module, configured to determine, according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region, clone copy numbers of the copy number variation region corresponding to the multiple clones in the tumor sample, respectively;
and the read logarithm correction module is used for determining a correction parameter set of each variation section in the copy number variation region according to the clone copy number of the copy number variation region, correcting the copy number variation region single point variation difference read logarithm in the copy number variation region single point variation site set by using the correction parameter set, and estimating the tumor purity according to the corrected copy number variation region single point variation difference read logarithm.
Optionally, in an embodiment of the present invention, the data obtaining module includes:
a data extraction unit, configured to extract a single-point variation set of a normal sample, a single-point variation set of a tumor sample, and a copy number variation set of the tumor sample from the detection result file;
a copy number variation region unit, configured to determine a copy number variation region single point variation site set of a tumor sample according to an initial position, a termination position, and a length of each copy number variation in the copy number variation set;
and the normal region unit is used for determining the normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.
Optionally, in an embodiment of the present invention, the average logarithm reading module includes:
the characteristic extraction unit is used for extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;
and the clustering result unit is used for clustering each single point variation locus in the normal region single point variation locus set according to the characteristics of each single point variation locus to obtain a clustering result.
Optionally, in an embodiment of the present invention, the average logarithm reading module is further configured to determine an average value of the features of the single-point mutation sites in each class according to the clustering result, and use the average value as the average logarithm of the single-point mutation of the normal region.
Optionally, in an embodiment of the present invention, the read logarithm correction module includes:
a single-point variation set unit, configured to determine a single-point variation set corresponding to each variation segment in the copy number variation region according to the clone copy number in the copy number variation region;
a ratio value set unit, configured to determine, according to the copy number variation region single point variation read logarithm in the copy number variation region single point variation site set, multiple groups of ratio values of each single point variation read logarithm in the single point variation set corresponding to each variation segment, to obtain multiple ratio value sets;
the error summation unit is used for determining the sum of the estimated allele frequency errors of the corresponding single-point variation according to the ratio value set;
and the correction parameter set unit is used for taking the ratio value set corresponding to the smallest sum of the estimated allele frequency errors as a correction parameter set when the sum of the estimated allele frequency errors is smallest.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the program.
The present invention also provides a computer-readable storage medium storing a computer program for executing the above method.
The method corrects the number of the increased single-point variation site reading pairs caused by copy number variation, uses the corrected data for tumor purity estimation, effectively corrects the number of the abnormal single-point variation reading pairs, and realizes accurate estimation of tumor purity under different coverage degrees and different tumor purities.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for estimating tumor purity according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating obtaining variant information data according to an embodiment of the present invention;
FIG. 3 is a flow chart of generating clustering results in an embodiment of the present invention;
FIG. 4 is a flow chart of generating a set of calibration parameters in an embodiment of the present invention;
FIG. 5 is a flow chart of a method of tumor purity estimation in an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for estimating tumor purity according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a data acquisition module according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating an exemplary structure of an average log read module according to an embodiment of the present invention;
FIG. 9 is a block diagram of a read log calibration module according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a tumor purity estimation method and device, which can be used in the financial field or other fields, and it should be noted that the tumor purity estimation method and device can be used in the financial field or any fields except the financial field, and the application fields of the tumor purity estimation method and device are not limited.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention is based on the following general consensus in academia:
1. the current common detection algorithm compares the reads generated by the second-generation sequencing technology with the reference sequence to obtain read data information, and determines the different types of chromosome variation and the information such as variation size and position;
2. copy number variation can cause deviation of read number information, and the number of read pairs of a single-point variation site in a copy number variation region can be multiplied compared with that of a normal region, so that the accuracy of tumor purity estimation is influenced.
Fig. 1 is a flowchart illustrating a tumor purity estimation method according to an embodiment of the present invention, wherein the tumor purity estimation method according to an embodiment of the present invention is executed by a computer. The method shown in the figure comprises the following steps:
step S1, obtaining a detection result file obtained by checking a normal sample and a tumor sample from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample.
The method comprises the steps of running an existing mutation detection tool on tumor samples with different purities, detecting copy number mutation and single-point mutation from second-generation double-end sequencing data, wherein the tumor tissues comprise initial clones and sub-clones of the clones after multiple rounds of selection and amplification due to heterogeneity of the tumor tissues, and the tumor samples comprise normal cells and a plurality of sub-clones to obtain chromosome mutation information data with different purities.
Specifically, a genetic single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set of the tumor sample are extracted from the detection result file. And extracting an abnormal single-point variation site set of a copy number variation region from all single-point variations according to the initial position, the termination position and the length of each copy number variation of the tumor sample, wherein the rest single-point variation sites are the single-point variation site sets of the normal region.
And step S2, clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result.
When the normal sample and the pure tumor sample are mixed, the genetic variation of the normal sample cannot provide effective distinguishing information of normal cells and pure tumor cells, so that variation sites used for estimating the tumor purity are the somatic variations of the tumor sample. The present invention uses a Gaussian Mixture model (English name: Gaussian Mixture Models, English abbreviation: GMM) for clustering. Specifically, the obtained clustering result is that the single point variation of the tumor normal region is divided into four combined genotypes by clustering, and the four combined genotypes can be specifically clustered into four classes by a Gaussian mixture model.
Specifically, a normal area single point variation site set of the tumor sample is obtained by contrasting a normal sample and a single point variation set obtained by detecting the tumor sample, and the characteristics of each single point variation site are extracted. And according to the obtained variation locus feature set, all the single-point variation loci in the normal region are clustered into four classes, and the average value of the single-point variation sample features in each class is respectively calculated according to the clustering result to obtain the average read logarithm of the single-point variation in the normal region of the four different combined genotype variation loci.
Step S3, determining copy number of clone in the copy number variation region corresponding to each of the plurality of clones in the tumor sample according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region.
Wherein the tumor sample is assumed to contain normal cells and two subclones S1And S2Comparing the single point variation site of each single point variation site in the normal area of the tumor sample to the read pair total number of the reference genome sequence from the normal cell and the tumor S1Cloning and tumor S2The number of reads for the clone consisted. The copy number variation region single point variation site set comprises the total number of read pairs of the single point variation sites of the tumor sample normal region aligned to the reference genome sequence and the total number of read pairs of the tumor sample copy number variation sites of the tumor sample copy number variation region aligned to the reference genome sequence, and the number of repeated read pairs caused by copy number variation is increased compared with the total number of read pairs of the single point variation sites of the tumor sample copy number variation sites of each copy number variation site of the copy number variation region aligned to the reference genome sequence. The copy number variation segment set has a plurality of copy number variation numbers and a plurality of copy number variation segments, and for each copy number variation segment, the copy number situation of each single point variation site in the segment is the same, and the same sub-clone S is provided1、S2The number of copies. And determining the copy number corresponding to the sub-clone by combining the average read logarithm of the single point variation in the normal region. When the copy number is solved, multiple clones S in the tumor sample are obtained by using multiple linear regression analysis for reference1、S2The copy number of the clone corresponding to the copy number variation region.
Step S4, determining a correction parameter set of each variation segment in the copy number variation region according to the copy number of clone in the copy number variation region, correcting the copy number variation region single point variation pair number in the copy number variation region single point variation site set by using the correction parameter set, and estimating tumor purity according to the corrected copy number variation region single point variation pair number.
The single-point variation copy number set of each copy number variation section can be determined according to the clone copy number of the section, the single-point variation sets of different copy number variation sections are extracted, the correction parameter set corresponding to the single-point variation in each copy number variation section is determined, and abnormal read pair number is corrected.
Further, after the read number correction is completed, the corrected single-point variant loci are mixed with variant loci in a normal region, allele frequency is calculated according to the read number of all loci, data characteristics including the read number and the allele frequency are integrated, and the purity of the polyclonal tumor sample considering copy number variation is estimated by using the existing tumor purity estimation method EMpurity.
As an embodiment of the present invention, as shown in fig. 2, obtaining variant information data by using the detection result file includes:
step S11, extracting a single point variation set of a normal sample, a single point variation set of a tumor sample and a copy number variation set of the tumor sample from the detection result file;
step S12, determining a copy number variation region single point variation site set of the tumor sample according to the initial position, the termination position and the length of each copy number variation in the copy number variation set;
step S13, determining a set of normal region single point variation sites of the tumor sample according to the set of normal sample single point variations and the set of tumor sample single point variations.
Wherein, a genetic single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set are extracted from the result file. Extracting an abnormal single-point variation site set of the section from all single-point variations according to the initial position, the termination position and the length of each copy number variation of the tumor sample, wherein the rest single-point variation sites are single-point variation site sets of a normal region, and each set comprises subsequently required reads sequence information as follows:
1) normal _ Reads 1: the number of read pairs of the single point variation site of the normal region of the tumor sample and the base of the reference genome sequence are the same;
2) normal _ Reads 2: the number of read pairs in which the single point mutation site of the normal region of the tumor sample is not matched with the base of the reference genome sequence;
3) normal _ Sum: comparing single-point variation sites in the normal region of the tumor sample to the total number of read pairs of the reference genome sequence;
4) tumor _ Reads 1: the number of read pairs of the single-point mutation sites of the copy number variation region of the tumor sample and the bases of the reference genome sequence are the same;
5) tumor _ Reads 2: the number of read pairs in which the site of the single point mutation in the copy number variation region of the tumor sample does not match the base of the reference genomic sequence;
6) tumor _ Sum: the number of single variation sites of the copy number variation region of the tumor sample is aligned to the total number of reads of the reference genomic sequence.
As an embodiment of the present invention, as shown in fig. 3, the clustering the set of single point mutation sites in the normal region to obtain a clustering result includes:
step S21, extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;
and step S22, clustering the single point variation sites in the normal region single point variation site set according to the characteristics of the single point variation sites to obtain a clustering result.
And comparing the single-point variation sets obtained by detecting the normal sample and the tumor sample to obtain a single-point variation site set I in the normal area of the tumor sample { I, I ═ 1,2, …, k }, and extracting (a ^ I, b ^ I) of each single-point variation site. And according to the obtained variation site feature set, clustering all single-point variation sites in the normal region into four types, thereby obtaining a clustering result.
In this embodiment, determining the average logarithm of single-point variation read in the normal region according to the clustering result includes: and determining the average value of the characteristics of the single-point variation sites in each class according to the clustering result, and taking the average value as the average read logarithm of the single-point variation in the normal region.
And averaging the characteristics of the single-point variation samples in each class according to the clustering result to obtain mean values (a-, b-) of four different combined genotype variation sites, namely Normal _ Reads1 and Normal _ Reads2, as the average read logarithm of the single-point variation in the Normal region.
Further, Normal _ Sum of each single point mutation site in the Normal region of the tumor sample is derived from Normal cells, tumor S1Cloning and tumor S2Number of cloned read pairs (n)1,n2,n3) Composition, Tumor _ Sum (denoted as n) of each site of single point mutation in copy number mutation region′) Increase the number of repeated read pairs n due to copy number variation compared to Normal _ Sum0. Recording the copy number variation segment set as D ═ DjJ is 1, … …, t is the number of copy number variations, and for each copy number variation segment djThe copy number of each single point mutation site in the segment is the same, and the same sub-clone S is contained1、S2Copy number (c)01,c02) Where multiple linear regression analysis is used in solving for copy number.
Specifically, the tumor sample contained two clones S1、S2And there is a genetic relationship between the two clones, S1The copy number variation of the clone will be transmitted to S under the condition of keeping the copy number unchanged2Cloning, and S2Copy number variation of clones did not affect S1Clones, and therefore the possibility of copy number variation on each clone, needs to be considered. Considering only one of the two strands of the genome to generate copy number variation, the relationship between the values of the two cloned copies and the number of reads due to the increase of the copy number variation is as follows:
recording two read number average value sets of four types of combined genotype normal region single-point mutation sites obtained by GMM clustering as number average value setsAnalysis of four genotype sheetsThe Normal _ Reads1 and the Normal _ Reads2 of the point mutation sites were foundIndicating the number of read pairs for a normal sample,represents tumor S2The number of read pairs of the clone(s),represents tumor S1The number of the cloned reading pairs can be approximately calculated according to the ratio of the number of the reading pairs of the normal cells and the two clones to the number of the reading pairs of the normal cells, and the normal sample, tumor S, in the total number of the single-point variant reading pairs in the normal region can be approximately calculated1Cloning and S2Ratio of read number of clones (. phi.)1,φ2,φ3)。
And (3) comparing the position of the copy number variation region variation site in a normal sample to the number of read pairs of the reference sequence to be used as the total number of the read pairs of the site without copy number variation interference, and solving the number of the read pairs of each part:
for each copy number variation segment djEach single point mutation site in (i ═ 1, … …, m), m is djThe number of abnormal single point variations in the inner, the estimated read pair total number isThe total number of true read pairs is ni', i-1, … …, m, the following regression equation:
extraction section djCalculating the copy number segment d according to the formulajRegression residual S of (c), loop iteration01,c02) E (0, 5), finding the regression residual under different copy number, comparing the copy number when all the regression residual record values are minimum, corresponding c01,c02For the two clones S sought1And S2The number of copies of (c).
As an embodiment of the present invention, as shown in fig. 4, determining a correction parameter set for each variant segment in the copy number variant region according to the clone copy number of the copy number variant region comprises:
step S41, determining a single point variation set corresponding to each variation section in the copy number variation region according to the clone copy number of the copy number variation region;
step S42, determining multiple groups of ratios of the single point variation read logarithms in the single point variation set corresponding to each variation segment according to the single point variation read logarithms in the copy number variation region single point variation site set, and obtaining multiple ratio value sets;
step S43, determining the sum of the estimated allele frequency errors of the corresponding single-point variation according to the ratio value set;
and step S44, when the sum of the estimated allele frequency errors is minimum, using the ratio value set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.
Wherein, the joint genotypes of the single-point variation sites are divided into four types, and the single-point variation under the interference of 8 copy number variations can occur by combining two types of copy number variations, the single-point variation of two sub-clones and the four joint genotypes. Two kinds of read-pair occupation ratios of two clones are represented by phi _21, phi _31, phi _22 and phi _32 respectively, phi _01 and phi _02 respectively represent the number occupation ratio of read pairs supporting normal read pairs and the number occupation ratio of read pairs supporting mutation in the increased read pairs caused by copy number mutation, and parameter value taking tables of possible values of each parameter under different copy numbers and genotypes are shown in Table 1.
TABLE 1
(n) according to a single point variation i1,n2,n3) And the number of actual read pairs n' determines the number of read pairs n with increased copy number variation0To solve for fractional proportion (phi'1,φ′2,φ′3,φ0) The ratio of each part is as follows:
normal-and variant-supporting read pair numbers N 'from which a single-point variant i was extracted'i1And N'i2Determining the true allele frequency value (P'i1,P′i2) Allele frequency VAF:
read-pair mean values based on (N ') for the different joint genotypes across the normal region'i1,N′i2) Excluding partial genotypes, the values of other parameters in Table 1 are substituted into the following formula to obtain the estimated allele frequency value of the single-point mutation site
Calculating the estimated allele frequency of the single-point variation f by using the thought of the mean square errorAnd true allele frequency (P'i1,P′i2) Error of (2) against copy number variation djThe m abnormal single point variations within find the sum of the allele frequency errors of all single point variations:
iterating all possible correction parameter sets according to the mean square error criterion to obtain the sum of the error SVAFThe smallest parameter, i.e. the set of parameters (phi) sought01,φ02,φ21,φ22,φ31,φ32) The corresponding genotype is that of the single point variation i.
The above steps are performed for the T copy number variations in the copy number variation segment set D, so that the parameters and the corresponding genotypes of each single point variation in the abnormal single point variation set T can be obtained, and the read pair numbers of each single point variation in the T are corrected by using the obtained parameters, so that the corrected read pair numbers supporting normal and supported variations are obtained.
Further, after the read pair number correction is completed, the corrected single-point variant loci and variant loci in the normal region are mixed, and the read pair number (N) of all loci is determined1,N2N) calculating allele frequencies (P)1,P2) The purity of the polyclonal tumor samples, taking into account copy number variation, was estimated using established tumor purity estimation methods, EMpurity, by integrating data characteristics including read pair number and allele frequency.
As an embodiment of the present invention, the specific process of estimating the purity of the tumor sample shown in fig. 5 includes:
and S100, preprocessing data. Obtaining variation information data, running an existing variation detection tool on tumor samples with different purities, and detecting copy number variation and single-point variation from the second-generation double-end sequencing data.
And S200, clustering normal single-point variation sites. And clustering normal single-point variation sites, and determining the average read logarithm of the single-point variation in the normal area.
In particular, the present invention uses a Gaussian Mixture Model (GMM) for clustering. Specifically, the obtained clustering result is that the single point variation of the tumor normal region is divided into four combined genotypes by clustering, and the four combined genotypes can be specifically clustered into four classes by a Gaussian mixture model.
Furthermore, a normal sample and the single point variation set obtained by detecting the tumor sample are contrasted to obtain a single point variation site set in a normal area of the tumor sample, and the characteristics of each single point variation site are extracted. And according to the obtained variation locus feature set, all the single-point variation loci in the normal region are clustered into four classes, and the average value of the single-point variation sample features in each class is respectively calculated according to the clustering result to obtain the average read logarithm of the single-point variation in the normal region of the four different combined genotype variation loci.
And S300, analyzing the number of the read pairs increased by the abnormal single-point mutation sites. And solving the number of the read pairs with the increased copy number variation according to the read pairs of the single point variation and the real read pairs.
And S400, iterating the different clone copy number sets. And determining the copy numbers of the multiple clones by using regression residual analysis, calculating the regression residual of the copy number section, and performing loop iteration to obtain the regression residual under different copy numbers.
And S500, judging whether the error is minimum or not. If not, go to step S400; if yes, go to step S600. And comparing the copy number conditions when all the regression residual record values are minimum, wherein the corresponding copy number is the copy number of the corresponding multiple clones.
And S600, iterating different genotype parameter sets. Different sets of calibration parameters, corresponding to different genotypes.
And S700, judging whether the error is minimum. If not, go to step S600; if yes, go to step S800. And iterating all correction parameter sets according to a mean square error criterion to obtain parameters which enable the error sum to be minimum, namely the obtained correction parameter sets, wherein the corresponding genotypes are the genotypes of the single-point variation.
And S800, correcting the abnormal single-point variation read logarithm. And correcting the number of read pairs of each single point variation in the abnormal single point variation set by using the obtained parameters to obtain the corrected number of read pairs supporting normal and supported variations. Tumor purity estimation was performed using corrected read log.
The method estimates the average read pair number of normal single-point variation sites by a clustering method, then determines the copy number information of the segment by applying multivariate regression analysis to the single-point variation sites of different copy number variation sections, obtains the genotype of the single-point variation sites by a mean square error criterion, finally corrects the increased number of the single-point variation site read pairs caused by copy number variation, and uses the corrected data in the existing method to estimate the tumor purity. The invention solves the problem that the copy number variation affects the deviation of the number information of tumor cells, solves the problem that the coexistence of multiple subclones affects the estimation of tumor purity, and solves the problem that the different copy number variations on different subclones cause the inaccurate estimation of purity. The invention effectively corrects the number of abnormal single-point variation read pairs, and accurately estimates the tumor purity under different coverage degrees and different tumor purities by using the corrected data.
Fig. 6 is a schematic structural diagram of an apparatus for estimating tumor purity according to an embodiment of the present invention, wherein the apparatus includes:
a data obtaining module 10, configured to obtain a detection result file obtained by examining a normal sample and a tumor sample from a chromosome variation detection tool, and obtain variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample.
The method comprises the steps of running an existing mutation detection tool on tumor samples with different purities, detecting copy number mutation and single-point mutation from second-generation double-end sequencing data, wherein the tumor tissues comprise initial clones and sub-clones of the clones after multiple rounds of selection and amplification due to heterogeneity of the tumor tissues, and the tumor samples comprise normal cells and a plurality of sub-clones to obtain chromosome mutation information data with different purities.
Specifically, a genetic single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set of the tumor sample are extracted from the detection result file. And extracting an abnormal single-point variation site set of a copy number variation region from all single-point variations according to the initial position, the termination position and the length of each copy number variation of the tumor sample, wherein the rest single-point variation sites are the single-point variation site sets of the normal region.
And an average read logarithm module 20, configured to cluster the normal region single-point variation site set to obtain a clustering result, and determine a normal region single-point variation average read logarithm according to the clustering result.
When the normal sample and the pure tumor sample are mixed, the genetic variation of the normal sample cannot provide effective distinguishing information of normal cells and pure tumor cells, so that variation sites used for estimating the tumor purity are the somatic variations of the tumor sample. The invention uses a Gaussian mixture model for clustering. Specifically, the obtained clustering result is that the single point variation of the tumor normal region is divided into four combined genotypes by clustering, and the four combined genotypes can be specifically clustered into four classes by a Gaussian mixture model.
Specifically, a normal area single point variation site set of the tumor sample is obtained by contrasting a normal sample and a single point variation set obtained by detecting the tumor sample, and the characteristics of each single point variation site are extracted. And according to the obtained variation locus feature set, all the single-point variation loci in the normal region are clustered into four classes, and the average value of the single-point variation sample features in each class is respectively calculated according to the clustering result to obtain the average read logarithm of the single-point variation in the normal region of the four different combined genotype variation loci.
A clone copy number module 30, configured to determine, according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region, clone copy numbers of the copy number variation region corresponding to the multiple clones in the tumor sample.
Wherein the tumor sample is assumed to contain normal cells and two subclones S1And S2Comparing the single point variation site of each single point variation site in the normal area of the tumor sample to the read pair total number of the reference genome sequence from the normal cell and the tumor S1Cloning and tumor S2The number of reads for the clone consisted. The copy number variation region single point variation site set comprises the total number of read pairs of the single point variation sites of the tumor sample normal region aligned to the reference genome sequence and the total number of read pairs of the tumor sample copy number variation sites of the tumor sample copy number variation region aligned to the reference genome sequence, and the number of repeated read pairs caused by copy number variation is increased compared with the total number of read pairs of the single point variation sites of the tumor sample copy number variation sites of each copy number variation site of the copy number variation region aligned to the reference genome sequence. The copy number variation segment set has a plurality of copy number variation numbers and a plurality of copy number variation segments, and for each copy number variation segment, the copy number situation of each single point variation site in the segment is the same, and the same sub-clone S is provided1、S2The number of copies. And determining the copy number corresponding to the sub-clone by combining the average read logarithm of the single point variation in the normal region. When the copy number is solved, multiple clones S in the tumor sample are obtained by using multiple linear regression analysis for reference1、S2The copy number of the clone corresponding to the copy number variation region.
And the log-reading correction module 40 is configured to determine a correction parameter set of each variation section in the copy number variation region according to the copy number variation region clone copy number, correct the copy number variation region single-point variation read-difference log in the copy number variation region single-point variation site set by using the correction parameter set, and estimate tumor purity according to the corrected copy number variation region single-point variation read-difference log.
The single-point variation copy number set of each copy number variation section can be determined according to the clone copy number of the section, the single-point variation sets of different copy number variation sections are extracted, the correction parameter set corresponding to the single-point variation in each copy number variation section is determined, and abnormal read pair number is corrected.
Further, after the read number correction is completed, the corrected single-point variant loci are mixed with variant loci in a normal region, allele frequency is calculated according to the read number of all loci, data characteristics including the read number and the allele frequency are integrated, and the purity of the polyclonal tumor sample considering copy number variation is estimated by using the existing tumor purity estimation method EMpurity.
As an embodiment of the present invention, as shown in fig. 7, the data acquisition module 10 includes:
a data extracting unit 11, configured to extract a single-point variation set of a normal sample, a single-point variation set of a tumor sample, and a copy number variation set of the tumor sample from the detection result file;
a copy number variation region unit 12, configured to determine a copy number variation region single point variation site set of a tumor sample according to a start position, an end position, and a length of each copy number variation in the copy number variation set;
the normal region unit 13 is configured to determine a normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.
As an embodiment of the present invention, as shown in fig. 8, the average read logarithm module 20 includes:
a feature extraction unit 21, configured to perform feature extraction on the normal region single point variation site set to obtain features of each single point variation site in the normal region single point variation site set;
and a clustering result unit 22, configured to cluster each single point variation site in the normal region single point variation site set according to a characteristic of each single point variation site, so as to obtain a clustering result.
As an embodiment of the present invention, the average read logarithm module is further configured to determine an average value of the features of the single-point mutation sites in each class according to the clustering result, and use the average value as the average read logarithm of the single-point mutation in the normal region.
As an embodiment of the present invention, as shown in fig. 9, the read log correction module 40 includes:
a single point variation set unit 41, configured to determine a single point variation set corresponding to each variation segment in the copy number variation region according to the copy number of the clone in the copy number variation region;
a ratio value set unit 42, configured to determine, according to the copy number variation region single point variation read logarithm in the copy number variation region single point variation site set, multiple groups of ratio values of each single point variation read logarithm in the single point variation set corresponding to each variation segment, so as to obtain multiple ratio value sets;
an error summation unit 43, configured to determine a sum of estimated allele frequency errors of the corresponding single-point variations according to the ratio set;
and a correction parameter set unit 44, configured to, when the sum of the estimated allele frequency errors is minimum, use a ratio set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.
Based on the same application concept as the tumor purity estimation method, the invention also provides the tumor purity estimation device. Because the principle of solving the problem of the tumor purity estimation device is similar to that of a tumor purity estimation method, the implementation of the tumor purity estimation device can refer to the implementation of the tumor purity estimation method, and repeated details are not repeated.
The method corrects the number of the increased single-point variation site reading pairs caused by copy number variation, uses the corrected data for tumor purity estimation, effectively corrects the number of the abnormal single-point variation reading pairs, and realizes accurate estimation of tumor purity under different coverage degrees and different tumor purities.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the program.
The present invention also provides a computer-readable storage medium storing a computer program for executing the above method.
As shown in fig. 10, the electronic device 600 may further include: communication module 110, input unit 120, audio processing unit 130, display 160, power supply 170. It is noted that the electronic device 600 does not necessarily include all of the components shown in FIG. 10; furthermore, the electronic device 600 may also comprise components not shown in fig. 10, which may be referred to in the prior art.
As shown in fig. 10, the central processor 100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processor 100 receiving input and controlling the operation of the various components of the electronic device 600.
The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 100 may execute the program stored in the memory 140 to realize information storage or processing, etc.
The input unit 120 provides input to the cpu 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used to display an object to be displayed, such as an image or a character. The display may be, for example, an LCD display, but is not limited thereto.
The memory 140 may be a solid state memory such as Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 140 may also be some other type of device. Memory 140 includes buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage section 142, and the application/function storage section 142 is used to store application programs and function programs or a flow for executing the operation of the electronic device 600 by the central processing unit 100.
The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage portion 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging application, address book application, etc.).
The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. The communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and receive audio input from the microphone 132 to implement general telecommunications functions. Audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor 130 is also coupled to the central processor 100, so that recording on the local can be enabled through a microphone 132, and so that sound stored on the local can be played through a speaker 131.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (12)
1. A method of tumor purity estimation, the method comprising:
obtaining a detection result file obtained by checking a normal sample and a tumor sample from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample;
clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result;
determining copy number of clone copies of the copy number variation region corresponding to the multiple clones in the tumor sample according to the normal region single point variation average read logarithm and the copy number variation region single point variation site set;
and determining a correction parameter set of each variation section in the copy number variation region according to the copy number of clone in the copy number variation region, correcting the copy number variation region single point variation read-difference logarithm in the copy number variation region single point variation site set by using the correction parameter set, and estimating the tumor purity according to the corrected copy number variation region single point variation read-difference logarithm.
2. The method of claim 1, wherein the obtaining variant information data using the test result file comprises:
extracting a single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set of the tumor sample from the detection result file;
determining a copy number variation region single point variation site set of a tumor sample according to the initial position, the termination position and the length of each copy number variation in the copy number variation set;
and determining a normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.
3. The method of claim 1, wherein the clustering the set of normal region single point mutation sites to obtain a clustering result comprises:
extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;
and clustering the single point variation sites in the normal region single point variation site set according to the characteristics of the single point variation sites to obtain a clustering result.
4. The method of claim 3, wherein determining the normal region single point variation average read logarithm comprises, according to the clustering result:
and determining the average value of the characteristics of the single-point variation sites in each class according to the clustering result, and taking the average value as the average read logarithm of the single-point variation in the normal region.
5. The method of claim 1, wherein determining a set of calibration parameters for each variant segment in a copy number variant region based on the copy number variant region clonal copy numbers comprises:
determining a single-point variation set corresponding to each variation section in the copy number variation region according to the clone copy number of the copy number variation region;
determining multiple groups of ratio values of the single point variation read logarithms in the single point variation sets corresponding to the variation sections according to the single point variation read logarithms in the copy number variation regions in the single point variation site sets of the copy number variation regions to obtain multiple ratio value sets;
determining the sum of the estimated allele frequency errors of the corresponding single-point variations according to the ratio value set;
and when the sum of the estimated allele frequency errors is minimum, taking the ratio value set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.
6. An apparatus for estimating tumor purity, the apparatus comprising:
the data acquisition module is used for acquiring detection result files obtained by checking normal samples and tumor samples from the chromosome variation detection tool and obtaining variation information data by using the detection result files; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample;
the average read logarithm module is used for clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result;
a clone copy number module, configured to determine, according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region, clone copy numbers of the copy number variation region corresponding to the multiple clones in the tumor sample, respectively;
and the read logarithm correction module is used for determining a correction parameter set of each variation section in the copy number variation region according to the clone copy number of the copy number variation region, correcting the copy number variation region single point variation difference read logarithm in the copy number variation region single point variation site set by using the correction parameter set, and estimating the tumor purity according to the corrected copy number variation region single point variation difference read logarithm.
7. The apparatus of claim 6, wherein the data acquisition module comprises:
a data extraction unit, configured to extract a single-point variation set of a normal sample, a single-point variation set of a tumor sample, and a copy number variation set of the tumor sample from the detection result file;
a copy number variation region unit, configured to determine a copy number variation region single point variation site set of a tumor sample according to an initial position, a termination position, and a length of each copy number variation in the copy number variation set;
and the normal region unit is used for determining the normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.
8. The apparatus of claim 6, wherein the average log read module comprises:
the characteristic extraction unit is used for extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;
and the clustering result unit is used for clustering each single point variation locus in the normal region single point variation locus set according to the characteristics of each single point variation locus to obtain a clustering result.
9. The apparatus of claim 8, wherein the mean log-read module is further configured to determine an average value of the features of the single-point mutation sites in each class according to the clustering result, and use the average value as the mean log of single-point mutation in the normal region.
10. The apparatus of claim 6, wherein the read log correction module comprises:
a single-point variation set unit, configured to determine a single-point variation set corresponding to each variation segment in the copy number variation region according to the clone copy number in the copy number variation region;
a ratio value set unit, configured to determine, according to the copy number variation region single point variation read logarithm in the copy number variation region single point variation site set, multiple groups of ratio values of each single point variation read logarithm in the single point variation set corresponding to each variation segment, to obtain multiple ratio value sets;
the error summation unit is used for determining the sum of the estimated allele frequency errors of the corresponding single-point variation according to the ratio value set;
and the correction parameter set unit is used for taking the ratio value set corresponding to the smallest sum of the estimated allele frequency errors as a correction parameter set when the sum of the estimated allele frequency errors is smallest.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when executing the computer program.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110350647.7A CN112863594A (en) | 2021-03-31 | 2021-03-31 | Tumor purity estimation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110350647.7A CN112863594A (en) | 2021-03-31 | 2021-03-31 | Tumor purity estimation method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112863594A true CN112863594A (en) | 2021-05-28 |
Family
ID=75991938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110350647.7A Pending CN112863594A (en) | 2021-03-31 | 2021-03-31 | Tumor purity estimation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112863594A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113658638A (en) * | 2021-08-20 | 2021-11-16 | 江苏先声医学诊断有限公司 | Detection method and quality control system for homologous recombination defects based on NGS platform |
CN113990389A (en) * | 2021-12-27 | 2022-01-28 | 北京优迅医疗器械有限公司 | Method and device for deducing tumor purity and ploidy |
CN115404275A (en) * | 2022-08-17 | 2022-11-29 | 中山大学·深圳 | Method for evaluating tumor purity based on nanopore sequencing technology |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106676178A (en) * | 2017-01-19 | 2017-05-17 | 北京吉因加科技有限公司 | System and method for tumor heterogeneity assessment |
CN108733975A (en) * | 2018-03-29 | 2018-11-02 | 深圳裕策生物科技有限公司 | Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations |
CN109920485A (en) * | 2018-12-29 | 2019-06-21 | 浙江安诺优达生物科技有限公司 | The method and its application of variation simulation are carried out to sequencing sequence |
CN110289047A (en) * | 2019-05-15 | 2019-09-27 | 西安电子科技大学 | Tumour purity and absolute copy number prediction technique and system based on sequencing data |
CN110770838A (en) * | 2017-12-01 | 2020-02-07 | Illumina公司 | Method and system for determining clonality of somatic mutations |
CN111402952A (en) * | 2020-03-27 | 2020-07-10 | 深圳裕策生物科技有限公司 | Method and system for detecting tumor heterogeneity degree |
-
2021
- 2021-03-31 CN CN202110350647.7A patent/CN112863594A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106676178A (en) * | 2017-01-19 | 2017-05-17 | 北京吉因加科技有限公司 | System and method for tumor heterogeneity assessment |
CN110770838A (en) * | 2017-12-01 | 2020-02-07 | Illumina公司 | Method and system for determining clonality of somatic mutations |
CN108733975A (en) * | 2018-03-29 | 2018-11-02 | 深圳裕策生物科技有限公司 | Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations |
CN109920485A (en) * | 2018-12-29 | 2019-06-21 | 浙江安诺优达生物科技有限公司 | The method and its application of variation simulation are carried out to sequencing sequence |
CN110289047A (en) * | 2019-05-15 | 2019-09-27 | 西安电子科技大学 | Tumour purity and absolute copy number prediction technique and system based on sequencing data |
CN111402952A (en) * | 2020-03-27 | 2020-07-10 | 深圳裕策生物科技有限公司 | Method and system for detecting tumor heterogeneity degree |
Non-Patent Citations (3)
Title |
---|
RIESTER ET AL.: "PureCN: copy number calling and SNV classification using targeted short read sequencing", SOURCE CODE FOR BIOLOGY AND MEDICINE, no. 11, 31 December 2016 (2016-12-31), pages 1 - 13 * |
王先月: "融合肿瘤纯度和DNA拷贝谱检测体细胞突变的新方法研究", 中国优秀硕士论文全文数据库 医药卫生科技辑, no. 05, 15 May 2020 (2020-05-15) * |
闫占正;李玉双;: "基于高斯混合模型的肿瘤纯度估计", 浙江大学学报(理学版), no. 02, 15 March 2020 (2020-03-15) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113658638A (en) * | 2021-08-20 | 2021-11-16 | 江苏先声医学诊断有限公司 | Detection method and quality control system for homologous recombination defects based on NGS platform |
CN113990389A (en) * | 2021-12-27 | 2022-01-28 | 北京优迅医疗器械有限公司 | Method and device for deducing tumor purity and ploidy |
CN115404275A (en) * | 2022-08-17 | 2022-11-29 | 中山大学·深圳 | Method for evaluating tumor purity based on nanopore sequencing technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu | A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data | |
CN112863594A (en) | Tumor purity estimation method and device | |
US11697835B2 (en) | Systems and methods for epigenetic analysis | |
Li | A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data | |
Goya et al. | SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors | |
Willenbrock et al. | A comparison study: applying segmentation to array CGH data for downstream analyses | |
Ivakhno et al. | CNAseg—a novel framework for identification of copy number changes in cancer from second-generation sequencing data | |
Kockan et al. | SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA | |
Kuleshov | Probabilistic single-individual haplotyping | |
Modolo et al. | UrQt: an efficient software for the Unsupervised Quality trimming of NGS data | |
US20220130488A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
Yu et al. | CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data | |
EP3084426B1 (en) | Iterative clustering of sequence reads for error correction | |
CN108647495B (en) | Identity relationship identification method, device, equipment and storage medium | |
Kristmundsdóttir et al. | popSTR: population-scale detection of STR variants | |
KR20200107774A (en) | How to align targeting nucleic acid sequencing data | |
KR102273257B1 (en) | Copy number variations detecting method based on read-depth and analysis apparatus | |
US20180225413A1 (en) | Base Coverage Normalization and Use Thereof in Detecting Copy Number Variation | |
CN116386718B (en) | Method, apparatus and medium for detecting copy number variation | |
Smolander et al. | Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data | |
Forde et al. | Review and further developments in statistical corrections for Winner’s Curse in genetic association studies | |
WO2019132010A1 (en) | Method, apparatus and program for estimating base type in base sequence | |
Chen et al. | A probe-density-based analysis method for array CGH data: simulation, normalization and centralization | |
Choo-Wosoba et al. | A hidden Markov modeling approach for identifying tumor subclones in next-generation sequencing studies | |
WO2019213810A1 (en) | Method, apparatus, and system for detecting chromosome aneuploidy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |