CN112863594A

CN112863594A - Tumor purity estimation method and device

Info

Publication number: CN112863594A
Application number: CN202110350647.7A
Authority: CN
Inventors: 崔佳; 陈永录; 贾建红
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-05-28

Abstract

A tumor purity estimation method and device can be used in the technical field of data science, the technical field of finance or other fields. The method comprises the following steps: obtaining a detection result file from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; clustering the normal region single-point variation site set to obtain a clustering result, determining the average read logarithm of the single-point variation of the normal region and determining the clone copy number of the copy number variation region corresponding to a plurality of clones in a tumor sample according to the clustering result; and determining a correction parameter set of each variation section in the copy number variation region according to the clone copy number of the copy number variation region, correcting the single-point variation difference reading logarithm of the copy number variation region by using the correction parameter set, and estimating the tumor purity according to the corrected single-point variation difference reading logarithm of the copy number variation region. The method effectively corrects the number of abnormal single-point variation read pairs, and realizes accurate estimation of tumor purity under different coverage degrees and different tumor purities.

Description

Tumor purity estimation method and device

Technical Field

The invention relates to the technical field of data science, in particular to a method and a device for estimating tumor purity.

Background

Tumors are produced by the accumulation of normal cellular genomic variations, and one of the major causes of death is malignant tumors. The tumor tissue is very complex in composition, and not only contains tumor cells, but also contains important non-cancer cells such as immune cells, fibroblasts and the like, and nutrients, chemokines and the like for promoting and inhibiting the growth of tumors. Tumor cells in tumor tissue are heterogeneous cell populations comprising polyclonal cells with different genetic mutations, which can be broadly divided into primary variant cells and cells resulting from multiple rounds of selection and clonal expansion of the cells, and at the nucleotide level, it is unlikely that two tumor cells will be identical. The clone that exhibits the most ancestral features, estimated by sequencing data, is called the initial clone, whose genome accumulates multiple chromosomal variations, including structural and single point variations. It is assumed that tumor cells in a tumor sample meet an infinite site hypothesis in the evolution process, i.e., a site is mutated at most once in the whole evolution process, and the mutated site cannot be recovered. The subclones mutated after early clonal amplification and recent clonal amplification are called as daughter clones, and the daughter clones inherit the chromosomal variation of the parent and generate variation which is different from the parent and is beneficial to the self, so that different subclones contain different chromosomal variations.

Copy Number Variations (English name: Copy Number variants, English abbreviation: CNVs) are an important class of sign aberrations in the tumor genome and have been extensively studied in order to understand cancer mutations and clonal evolution. Copy number variation causes deviation of genome information of tumor tissues, the number of pairs read from a single-point variation site on a gene fragment with copy number variation is multiplied compared with that of a gene fragment without copy number variation, and the external tumor is a heterogeneous cell population, so that the copy number variation conditions on different clone structures are different. When copy number variation appears on a parent clone, a child clone not only inherits the variation of the parent clone, but also can generate own copy number variation; when a copy number variation occurs on a child clone, the copy number variation may result in a doubling of the chromosomal variation that the child clone inherits from the parent clone, and may also result in a doubling of the chromosomal variation that the child clone generates to favor itself.

Tumor purity estimation refers to the accurate assessment of the proportion of tumor cells from mixed tumor tissue sequencing data, which is extremely complex and differs in various cancer types, sequencing types and sampling tissues. Tumor purity can be estimated not only by pathologists through visualization or graphical analysis of tumor cells, but also with the development of genomic techniques such as statistical methods for linear models, maximum likelihood models, bayesian methods, etc. computational methods can be used to infer tumor purity, and different types of genomic information such as gene expression, copy number variation, somatic variation, DNA methylation, etc. are used. Depending on the data used, tumor purity estimation methods fall roughly into two categories: the first type is based on SNP array data; the second category is based on sequencing data. For the first method based on SNP array data, the method detects chromosome abnormality (copy number abnormality, heterozygosity) in cells by using high throughput data obtained by single nucleotide polymorphism microarray experimental technology, thereby estimating the purity of tumor tissues, including ABSOLUTE, ASCAT and the like. For the second category of Sequencing Data-based methods, the methods are directed to the use of Cancer Sequencing Data (Cancer Sequencing Data), including puriteest, AbsCN _ seq, CNAnorm, THetA, and PurBayes, among others.

However, the estimation performance of both methods is not good, and the following problems mainly exist: first, the first method works only for a single subclone case, and thus is not well suited for multiple subclone analysis when multiple subclones are present within a tumor cell; secondly, the second method has limited effect on tumor purity estimation under the condition of coexistence of multiple subclones; thirdly, for the coexistence of copy number variation and multi-stage subcloning, the performance of the above two methods is limited, and the purity of tumor cells cannot be accurately estimated.

Disclosure of Invention

In view of the problems in the prior art, an embodiment of the present invention provides a method and an apparatus for estimating tumor purity, which achieve accurate estimation of tumor sample purity for copy number variation.

In order to achieve the above object, an embodiment of the present invention provides a tumor purity estimation method, including:

obtaining a detection result file obtained by checking a normal sample and a tumor sample from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample;

clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result;

determining copy number of clone copies of the copy number variation region corresponding to the multiple clones in the tumor sample according to the normal region single point variation average read logarithm and the copy number variation region single point variation site set;

and determining a correction parameter set of each variation section in the copy number variation region according to the copy number of clone in the copy number variation region, correcting the copy number variation region single point variation read-difference logarithm in the copy number variation region single point variation site set by using the correction parameter set, and estimating the tumor purity according to the corrected copy number variation region single point variation read-difference logarithm.

Optionally, in an embodiment of the present invention, the obtaining variant information data by using the detection result file includes:

extracting a single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set of the tumor sample from the detection result file;

determining a copy number variation region single point variation site set of a tumor sample according to the initial position, the termination position and the length of each copy number variation in the copy number variation set;

and determining a normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.

Optionally, in an embodiment of the present invention, the clustering the set of single point variant loci in the normal region to obtain a clustering result includes:

extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;

and clustering the single point variation sites in the normal region single point variation site set according to the characteristics of the single point variation sites to obtain a clustering result.

Optionally, in an embodiment of the present invention, the determining the average logarithm of single-point variation of the read log of the normal region according to the clustering result includes:

and determining the average value of the characteristics of the single-point variation sites in each class according to the clustering result, and taking the average value as the average read logarithm of the single-point variation in the normal region.

Optionally, in an embodiment of the present invention, the determining a correction parameter set of each variant segment in the copy number variant region according to the clone copy number of the copy number variant region includes:

determining a single-point variation set corresponding to each variation section in the copy number variation region according to the clone copy number of the copy number variation region;

determining multiple groups of ratio values of the single point variation read logarithms in the single point variation sets corresponding to the variation sections according to the single point variation read logarithms in the copy number variation regions in the single point variation site sets of the copy number variation regions to obtain multiple ratio value sets;

determining the sum of the estimated allele frequency errors of the corresponding single-point variations according to the ratio value set;

and when the sum of the estimated allele frequency errors is minimum, taking the ratio value set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.

The embodiment of the invention also provides a tumor purity estimation device, which comprises:

the data acquisition module is used for acquiring detection result files obtained by checking normal samples and tumor samples from the chromosome variation detection tool and obtaining variation information data by using the detection result files; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample;

the average read logarithm module is used for clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result;

a clone copy number module, configured to determine, according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region, clone copy numbers of the copy number variation region corresponding to the multiple clones in the tumor sample, respectively;

and the read logarithm correction module is used for determining a correction parameter set of each variation section in the copy number variation region according to the clone copy number of the copy number variation region, correcting the copy number variation region single point variation difference read logarithm in the copy number variation region single point variation site set by using the correction parameter set, and estimating the tumor purity according to the corrected copy number variation region single point variation difference read logarithm.

Optionally, in an embodiment of the present invention, the data obtaining module includes:

a data extraction unit, configured to extract a single-point variation set of a normal sample, a single-point variation set of a tumor sample, and a copy number variation set of the tumor sample from the detection result file;

a copy number variation region unit, configured to determine a copy number variation region single point variation site set of a tumor sample according to an initial position, a termination position, and a length of each copy number variation in the copy number variation set;

and the normal region unit is used for determining the normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.

Optionally, in an embodiment of the present invention, the average logarithm reading module includes:

the characteristic extraction unit is used for extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;

and the clustering result unit is used for clustering each single point variation locus in the normal region single point variation locus set according to the characteristics of each single point variation locus to obtain a clustering result.

Optionally, in an embodiment of the present invention, the average logarithm reading module is further configured to determine an average value of the features of the single-point mutation sites in each class according to the clustering result, and use the average value as the average logarithm of the single-point mutation of the normal region.

Optionally, in an embodiment of the present invention, the read logarithm correction module includes:

a single-point variation set unit, configured to determine a single-point variation set corresponding to each variation segment in the copy number variation region according to the clone copy number in the copy number variation region;

a ratio value set unit, configured to determine, according to the copy number variation region single point variation read logarithm in the copy number variation region single point variation site set, multiple groups of ratio values of each single point variation read logarithm in the single point variation set corresponding to each variation segment, to obtain multiple ratio value sets;

the error summation unit is used for determining the sum of the estimated allele frequency errors of the corresponding single-point variation according to the ratio value set;

and the correction parameter set unit is used for taking the ratio value set corresponding to the smallest sum of the estimated allele frequency errors as a correction parameter set when the sum of the estimated allele frequency errors is smallest.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the program.

The present invention also provides a computer-readable storage medium storing a computer program for executing the above method.

The method corrects the number of the increased single-point variation site reading pairs caused by copy number variation, uses the corrected data for tumor purity estimation, effectively corrects the number of the abnormal single-point variation reading pairs, and realizes accurate estimation of tumor purity under different coverage degrees and different tumor purities.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for estimating tumor purity according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating obtaining variant information data according to an embodiment of the present invention;

FIG. 3 is a flow chart of generating clustering results in an embodiment of the present invention;

FIG. 4 is a flow chart of generating a set of calibration parameters in an embodiment of the present invention;

FIG. 5 is a flow chart of a method of tumor purity estimation in an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for estimating tumor purity according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a data acquisition module according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating an exemplary structure of an average log read module according to an embodiment of the present invention;

FIG. 9 is a block diagram of a read log calibration module according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a tumor purity estimation method and device, which can be used in the financial field or other fields, and it should be noted that the tumor purity estimation method and device can be used in the financial field or any fields except the financial field, and the application fields of the tumor purity estimation method and device are not limited.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention is based on the following general consensus in academia:

1. the current common detection algorithm compares the reads generated by the second-generation sequencing technology with the reference sequence to obtain read data information, and determines the different types of chromosome variation and the information such as variation size and position;

2. copy number variation can cause deviation of read number information, and the number of read pairs of a single-point variation site in a copy number variation region can be multiplied compared with that of a normal region, so that the accuracy of tumor purity estimation is influenced.

Fig. 1 is a flowchart illustrating a tumor purity estimation method according to an embodiment of the present invention, wherein the tumor purity estimation method according to an embodiment of the present invention is executed by a computer. The method shown in the figure comprises the following steps:

step S1, obtaining a detection result file obtained by checking a normal sample and a tumor sample from a chromosome variation detection tool, and obtaining variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample.

The method comprises the steps of running an existing mutation detection tool on tumor samples with different purities, detecting copy number mutation and single-point mutation from second-generation double-end sequencing data, wherein the tumor tissues comprise initial clones and sub-clones of the clones after multiple rounds of selection and amplification due to heterogeneity of the tumor tissues, and the tumor samples comprise normal cells and a plurality of sub-clones to obtain chromosome mutation information data with different purities.

Specifically, a genetic single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set of the tumor sample are extracted from the detection result file. And extracting an abnormal single-point variation site set of a copy number variation region from all single-point variations according to the initial position, the termination position and the length of each copy number variation of the tumor sample, wherein the rest single-point variation sites are the single-point variation site sets of the normal region.

And step S2, clustering the normal region single-point variation site set to obtain a clustering result, and determining the average read logarithm of the normal region single-point variation according to the clustering result.

When the normal sample and the pure tumor sample are mixed, the genetic variation of the normal sample cannot provide effective distinguishing information of normal cells and pure tumor cells, so that variation sites used for estimating the tumor purity are the somatic variations of the tumor sample. The present invention uses a Gaussian Mixture model (English name: Gaussian Mixture Models, English abbreviation: GMM) for clustering. Specifically, the obtained clustering result is that the single point variation of the tumor normal region is divided into four combined genotypes by clustering, and the four combined genotypes can be specifically clustered into four classes by a Gaussian mixture model.

Specifically, a normal area single point variation site set of the tumor sample is obtained by contrasting a normal sample and a single point variation set obtained by detecting the tumor sample, and the characteristics of each single point variation site are extracted. And according to the obtained variation locus feature set, all the single-point variation loci in the normal region are clustered into four classes, and the average value of the single-point variation sample features in each class is respectively calculated according to the clustering result to obtain the average read logarithm of the single-point variation in the normal region of the four different combined genotype variation loci.

Step S3, determining copy number of clone in the copy number variation region corresponding to each of the plurality of clones in the tumor sample according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region.

Wherein the tumor sample is assumed to contain normal cells and two subclones S₁And S₂Comparing the single point variation site of each single point variation site in the normal area of the tumor sample to the read pair total number of the reference genome sequence from the normal cell and the tumor S₁Cloning and tumor S₂The number of reads for the clone consisted. The copy number variation region single point variation site set comprises the total number of read pairs of the single point variation sites of the tumor sample normal region aligned to the reference genome sequence and the total number of read pairs of the tumor sample copy number variation sites of the tumor sample copy number variation region aligned to the reference genome sequence, and the number of repeated read pairs caused by copy number variation is increased compared with the total number of read pairs of the single point variation sites of the tumor sample copy number variation sites of each copy number variation site of the copy number variation region aligned to the reference genome sequence. The copy number variation segment set has a plurality of copy number variation numbers and a plurality of copy number variation segments, and for each copy number variation segment, the copy number situation of each single point variation site in the segment is the same, and the same sub-clone S is provided₁、S₂The number of copies. And determining the copy number corresponding to the sub-clone by combining the average read logarithm of the single point variation in the normal region. When the copy number is solved, multiple clones S in the tumor sample are obtained by using multiple linear regression analysis for reference₁、S₂The copy number of the clone corresponding to the copy number variation region.

Step S4, determining a correction parameter set of each variation segment in the copy number variation region according to the copy number of clone in the copy number variation region, correcting the copy number variation region single point variation pair number in the copy number variation region single point variation site set by using the correction parameter set, and estimating tumor purity according to the corrected copy number variation region single point variation pair number.

The single-point variation copy number set of each copy number variation section can be determined according to the clone copy number of the section, the single-point variation sets of different copy number variation sections are extracted, the correction parameter set corresponding to the single-point variation in each copy number variation section is determined, and abnormal read pair number is corrected.

Further, after the read number correction is completed, the corrected single-point variant loci are mixed with variant loci in a normal region, allele frequency is calculated according to the read number of all loci, data characteristics including the read number and the allele frequency are integrated, and the purity of the polyclonal tumor sample considering copy number variation is estimated by using the existing tumor purity estimation method EMpurity.

As an embodiment of the present invention, as shown in fig. 2, obtaining variant information data by using the detection result file includes:

step S11, extracting a single point variation set of a normal sample, a single point variation set of a tumor sample and a copy number variation set of the tumor sample from the detection result file;

step S12, determining a copy number variation region single point variation site set of the tumor sample according to the initial position, the termination position and the length of each copy number variation in the copy number variation set;

step S13, determining a set of normal region single point variation sites of the tumor sample according to the set of normal sample single point variations and the set of tumor sample single point variations.

Wherein, a genetic single-point variation set of a normal sample, a single-point variation set of a tumor sample and a copy number variation set are extracted from the result file. Extracting an abnormal single-point variation site set of the section from all single-point variations according to the initial position, the termination position and the length of each copy number variation of the tumor sample, wherein the rest single-point variation sites are single-point variation site sets of a normal region, and each set comprises subsequently required reads sequence information as follows:

1) normal _ Reads 1: the number of read pairs of the single point variation site of the normal region of the tumor sample and the base of the reference genome sequence are the same;

2) normal _ Reads 2: the number of read pairs in which the single point mutation site of the normal region of the tumor sample is not matched with the base of the reference genome sequence;

3) normal _ Sum: comparing single-point variation sites in the normal region of the tumor sample to the total number of read pairs of the reference genome sequence;

4) tumor _ Reads 1: the number of read pairs of the single-point mutation sites of the copy number variation region of the tumor sample and the bases of the reference genome sequence are the same;

5) tumor _ Reads 2: the number of read pairs in which the site of the single point mutation in the copy number variation region of the tumor sample does not match the base of the reference genomic sequence;

6) tumor _ Sum: the number of single variation sites of the copy number variation region of the tumor sample is aligned to the total number of reads of the reference genomic sequence.

As an embodiment of the present invention, as shown in fig. 3, the clustering the set of single point mutation sites in the normal region to obtain a clustering result includes:

step S21, extracting the characteristics of the normal region single point variation site set to obtain the characteristics of each single point variation site in the normal region single point variation site set;

and step S22, clustering the single point variation sites in the normal region single point variation site set according to the characteristics of the single point variation sites to obtain a clustering result.

And comparing the single-point variation sets obtained by detecting the normal sample and the tumor sample to obtain a single-point variation site set I in the normal area of the tumor sample { I, I ═ 1,2, …, k }, and extracting (a ^ I, b ^ I) of each single-point variation site. And according to the obtained variation site feature set, clustering all single-point variation sites in the normal region into four types, thereby obtaining a clustering result.

In this embodiment, determining the average logarithm of single-point variation read in the normal region according to the clustering result includes: and determining the average value of the characteristics of the single-point variation sites in each class according to the clustering result, and taking the average value as the average read logarithm of the single-point variation in the normal region.

And averaging the characteristics of the single-point variation samples in each class according to the clustering result to obtain mean values (a-, b-) of four different combined genotype variation sites, namely Normal _ Reads1 and Normal _ Reads2, as the average read logarithm of the single-point variation in the Normal region.

Further, Normal _ Sum of each single point mutation site in the Normal region of the tumor sample is derived from Normal cells, tumor S₁Cloning and tumor S₂Number of cloned read pairs (n)₁,n₂,n₃) Composition, Tumor _ Sum (denoted as n) of each site of single point mutation in copy number mutation region^′) Increase the number of repeated read pairs n due to copy number variation compared to Normal _ Sum₀. Recording the copy number variation segment set as D ═ D_jJ is 1, … …, t is the number of copy number variations, and for each copy number variation segment d_jThe copy number of each single point mutation site in the segment is the same, and the same sub-clone S is contained₁、S₂Copy number (c)₀₁，c₀₂) Where multiple linear regression analysis is used in solving for copy number.

Specifically, the tumor sample contained two clones S₁、S₂And there is a genetic relationship between the two clones, S₁The copy number variation of the clone will be transmitted to S under the condition of keeping the copy number unchanged₂Cloning, and S₂Copy number variation of clones did not affect S₁Clones, and therefore the possibility of copy number variation on each clone, needs to be considered. Considering only one of the two strands of the genome to generate copy number variation, the relationship between the values of the two cloned copies and the number of reads due to the increase of the copy number variation is as follows:

recording two read number average value sets of four types of combined genotype normal region single-point mutation sites obtained by GMM clustering as number average value sets

Analysis of four genotype sheetsThe Normal _ Reads1 and the Normal _ Reads2 of the point mutation sites were found

Indicating the number of read pairs for a normal sample,

represents tumor S₂The number of read pairs of the clone(s),

represents tumor S₁The number of the cloned reading pairs can be approximately calculated according to the ratio of the number of the reading pairs of the normal cells and the two clones to the number of the reading pairs of the normal cells, and the normal sample, tumor S, in the total number of the single-point variant reading pairs in the normal region can be approximately calculated₁Cloning and S₂Ratio of read number of clones (. phi.)₁，φ₂，φ₃)。

And (3) comparing the position of the copy number variation region variation site in a normal sample to the number of read pairs of the reference sequence to be used as the total number of the read pairs of the site without copy number variation interference, and solving the number of the read pairs of each part:

for each copy number variation segment d_jEach single point mutation site in (i ═ 1, … …, m), m is d_jThe number of abnormal single point variations in the inner, the estimated read pair total number is

The total number of true read pairs is n_i', i-1, … …, m, the following regression equation:

extraction section d_jCalculating the copy number segment d according to the formula_jRegression residual S of (c), loop iteration₀₁，c₀₂) E (0, 5), finding the regression residual under different copy number, comparing the copy number when all the regression residual record values are minimum, corresponding c₀₁，c₀₂For the two clones S sought₁And S₂The number of copies of (c).

As an embodiment of the present invention, as shown in fig. 4, determining a correction parameter set for each variant segment in the copy number variant region according to the clone copy number of the copy number variant region comprises:

step S41, determining a single point variation set corresponding to each variation section in the copy number variation region according to the clone copy number of the copy number variation region;

step S42, determining multiple groups of ratios of the single point variation read logarithms in the single point variation set corresponding to each variation segment according to the single point variation read logarithms in the copy number variation region single point variation site set, and obtaining multiple ratio value sets;

step S43, determining the sum of the estimated allele frequency errors of the corresponding single-point variation according to the ratio value set;

and step S44, when the sum of the estimated allele frequency errors is minimum, using the ratio value set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.

Wherein, the joint genotypes of the single-point variation sites are divided into four types, and the single-point variation under the interference of 8 copy number variations can occur by combining two types of copy number variations, the single-point variation of two sub-clones and the four joint genotypes. Two kinds of read-pair occupation ratios of two clones are represented by phi _21, phi _31, phi _22 and phi _32 respectively, phi _01 and phi _02 respectively represent the number occupation ratio of read pairs supporting normal read pairs and the number occupation ratio of read pairs supporting mutation in the increased read pairs caused by copy number mutation, and parameter value taking tables of possible values of each parameter under different copy numbers and genotypes are shown in Table 1.

TABLE 1

(n) according to a single point variation i₁，n₂，n₃) And the number of actual read pairs n' determines the number of read pairs n with increased copy number variation₀To solve for fractional proportion (phi'₁，φ′₂，φ′₃，φ₀) The ratio of each part is as follows:

normal-and variant-supporting read pair numbers N 'from which a single-point variant i was extracted'_i1And N'_i2Determining the true allele frequency value (P'_i1，P′_i2) Allele frequency VAF:

read-pair mean values based on (N ') for the different joint genotypes across the normal region'_i1，N′_i2) Excluding partial genotypes, the values of other parameters in Table 1 are substituted into the following formula to obtain the estimated allele frequency value of the single-point mutation site

Calculating the estimated allele frequency of the single-point variation f by using the thought of the mean square error

And true allele frequency (P'_i1，P′_i2) Error of (2) against copy number variation d_jThe m abnormal single point variations within find the sum of the allele frequency errors of all single point variations:

iterating all possible correction parameter sets according to the mean square error criterion to obtain the sum of the error S_VAFThe smallest parameter, i.e. the set of parameters (phi) sought₀₁，φ₀₂，φ₂₁，φ₂₂，φ₃₁，φ₃₂) The corresponding genotype is that of the single point variation i.

The above steps are performed for the T copy number variations in the copy number variation segment set D, so that the parameters and the corresponding genotypes of each single point variation in the abnormal single point variation set T can be obtained, and the read pair numbers of each single point variation in the T are corrected by using the obtained parameters, so that the corrected read pair numbers supporting normal and supported variations are obtained.

Further, after the read pair number correction is completed, the corrected single-point variant loci and variant loci in the normal region are mixed, and the read pair number (N) of all loci is determined₁，N₂N) calculating allele frequencies (P)₁，P₂) The purity of the polyclonal tumor samples, taking into account copy number variation, was estimated using established tumor purity estimation methods, EMpurity, by integrating data characteristics including read pair number and allele frequency.

As an embodiment of the present invention, the specific process of estimating the purity of the tumor sample shown in fig. 5 includes:

and S100, preprocessing data. Obtaining variation information data, running an existing variation detection tool on tumor samples with different purities, and detecting copy number variation and single-point variation from the second-generation double-end sequencing data.

And S200, clustering normal single-point variation sites. And clustering normal single-point variation sites, and determining the average read logarithm of the single-point variation in the normal area.

In particular, the present invention uses a Gaussian Mixture Model (GMM) for clustering. Specifically, the obtained clustering result is that the single point variation of the tumor normal region is divided into four combined genotypes by clustering, and the four combined genotypes can be specifically clustered into four classes by a Gaussian mixture model.

Furthermore, a normal sample and the single point variation set obtained by detecting the tumor sample are contrasted to obtain a single point variation site set in a normal area of the tumor sample, and the characteristics of each single point variation site are extracted. And according to the obtained variation locus feature set, all the single-point variation loci in the normal region are clustered into four classes, and the average value of the single-point variation sample features in each class is respectively calculated according to the clustering result to obtain the average read logarithm of the single-point variation in the normal region of the four different combined genotype variation loci.

And S300, analyzing the number of the read pairs increased by the abnormal single-point mutation sites. And solving the number of the read pairs with the increased copy number variation according to the read pairs of the single point variation and the real read pairs.

And S400, iterating the different clone copy number sets. And determining the copy numbers of the multiple clones by using regression residual analysis, calculating the regression residual of the copy number section, and performing loop iteration to obtain the regression residual under different copy numbers.

And S500, judging whether the error is minimum or not. If not, go to step S400; if yes, go to step S600. And comparing the copy number conditions when all the regression residual record values are minimum, wherein the corresponding copy number is the copy number of the corresponding multiple clones.

And S600, iterating different genotype parameter sets. Different sets of calibration parameters, corresponding to different genotypes.

And S700, judging whether the error is minimum. If not, go to step S600; if yes, go to step S800. And iterating all correction parameter sets according to a mean square error criterion to obtain parameters which enable the error sum to be minimum, namely the obtained correction parameter sets, wherein the corresponding genotypes are the genotypes of the single-point variation.

And S800, correcting the abnormal single-point variation read logarithm. And correcting the number of read pairs of each single point variation in the abnormal single point variation set by using the obtained parameters to obtain the corrected number of read pairs supporting normal and supported variations. Tumor purity estimation was performed using corrected read log.

The method estimates the average read pair number of normal single-point variation sites by a clustering method, then determines the copy number information of the segment by applying multivariate regression analysis to the single-point variation sites of different copy number variation sections, obtains the genotype of the single-point variation sites by a mean square error criterion, finally corrects the increased number of the single-point variation site read pairs caused by copy number variation, and uses the corrected data in the existing method to estimate the tumor purity. The invention solves the problem that the copy number variation affects the deviation of the number information of tumor cells, solves the problem that the coexistence of multiple subclones affects the estimation of tumor purity, and solves the problem that the different copy number variations on different subclones cause the inaccurate estimation of purity. The invention effectively corrects the number of abnormal single-point variation read pairs, and accurately estimates the tumor purity under different coverage degrees and different tumor purities by using the corrected data.

Fig. 6 is a schematic structural diagram of an apparatus for estimating tumor purity according to an embodiment of the present invention, wherein the apparatus includes:

a data obtaining module 10, configured to obtain a detection result file obtained by examining a normal sample and a tumor sample from a chromosome variation detection tool, and obtain variation information data by using the detection result file; the variation information data comprises a normal region single point variation site set of the tumor sample and a copy number variation region single point variation site set of the tumor sample.

And an average read logarithm module 20, configured to cluster the normal region single-point variation site set to obtain a clustering result, and determine a normal region single-point variation average read logarithm according to the clustering result.

When the normal sample and the pure tumor sample are mixed, the genetic variation of the normal sample cannot provide effective distinguishing information of normal cells and pure tumor cells, so that variation sites used for estimating the tumor purity are the somatic variations of the tumor sample. The invention uses a Gaussian mixture model for clustering. Specifically, the obtained clustering result is that the single point variation of the tumor normal region is divided into four combined genotypes by clustering, and the four combined genotypes can be specifically clustered into four classes by a Gaussian mixture model.

A clone copy number module 30, configured to determine, according to the average read logarithm of single point variation in the normal region and the set of single point variation sites in the copy number variation region, clone copy numbers of the copy number variation region corresponding to the multiple clones in the tumor sample.

And the log-reading correction module 40 is configured to determine a correction parameter set of each variation section in the copy number variation region according to the copy number variation region clone copy number, correct the copy number variation region single-point variation read-difference log in the copy number variation region single-point variation site set by using the correction parameter set, and estimate tumor purity according to the corrected copy number variation region single-point variation read-difference log.

As an embodiment of the present invention, as shown in fig. 7, the data acquisition module 10 includes:

a data extracting unit 11, configured to extract a single-point variation set of a normal sample, a single-point variation set of a tumor sample, and a copy number variation set of the tumor sample from the detection result file;

a copy number variation region unit 12, configured to determine a copy number variation region single point variation site set of a tumor sample according to a start position, an end position, and a length of each copy number variation in the copy number variation set;

the normal region unit 13 is configured to determine a normal region single point variation site set of the tumor sample according to the single point variation set of the normal sample and the single point variation set of the tumor sample.

As an embodiment of the present invention, as shown in fig. 8, the average read logarithm module 20 includes:

a feature extraction unit 21, configured to perform feature extraction on the normal region single point variation site set to obtain features of each single point variation site in the normal region single point variation site set;

and a clustering result unit 22, configured to cluster each single point variation site in the normal region single point variation site set according to a characteristic of each single point variation site, so as to obtain a clustering result.

As an embodiment of the present invention, the average read logarithm module is further configured to determine an average value of the features of the single-point mutation sites in each class according to the clustering result, and use the average value as the average read logarithm of the single-point mutation in the normal region.

As an embodiment of the present invention, as shown in fig. 9, the read log correction module 40 includes:

a single point variation set unit 41, configured to determine a single point variation set corresponding to each variation segment in the copy number variation region according to the copy number of the clone in the copy number variation region;

a ratio value set unit 42, configured to determine, according to the copy number variation region single point variation read logarithm in the copy number variation region single point variation site set, multiple groups of ratio values of each single point variation read logarithm in the single point variation set corresponding to each variation segment, so as to obtain multiple ratio value sets;

an error summation unit 43, configured to determine a sum of estimated allele frequency errors of the corresponding single-point variations according to the ratio set;

and a correction parameter set unit 44, configured to, when the sum of the estimated allele frequency errors is minimum, use a ratio set corresponding to the minimum sum of the estimated allele frequency errors as a correction parameter set.

Based on the same application concept as the tumor purity estimation method, the invention also provides the tumor purity estimation device. Because the principle of solving the problem of the tumor purity estimation device is similar to that of a tumor purity estimation method, the implementation of the tumor purity estimation device can refer to the implementation of the tumor purity estimation method, and repeated details are not repeated.

As shown in fig. 10, the electronic device 600 may further include: communication module 110, input unit 120, audio processing unit 130, display 160, power supply 170. It is noted that the electronic device 600 does not necessarily include all of the components shown in FIG. 10; furthermore, the electronic device 600 may also comprise components not shown in fig. 10, which may be referred to in the prior art.

As shown in fig. 10, the central processor 100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processor 100 receiving input and controlling the operation of the various components of the electronic device 600.

The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 100 may execute the program stored in the memory 140 to realize information storage or processing, etc.

The input unit 120 provides input to the cpu 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used to display an object to be displayed, such as an image or a character. The display may be, for example, an LCD display, but is not limited thereto.

The memory 140 may be a solid state memory such as Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 140 may also be some other type of device. Memory 140 includes buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage section 142, and the application/function storage section 142 is used to store application programs and function programs or a flow for executing the operation of the electronic device 600 by the central processing unit 100.

The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage portion 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging application, address book application, etc.).

The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. The communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and receive audio input from the microphone 132 to implement general telecommunications functions. Audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor 130 is also coupled to the central processor 100, so that recording on the local can be enabled through a microphone 132, and so that sound stored on the local can be played through a speaker 131.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of tumor purity estimation, the method comprising:

2. The method of claim 1, wherein the obtaining variant information data using the test result file comprises:

3. The method of claim 1, wherein the clustering the set of normal region single point mutation sites to obtain a clustering result comprises:

4. The method of claim 3, wherein determining the normal region single point variation average read logarithm comprises, according to the clustering result:

5. The method of claim 1, wherein determining a set of calibration parameters for each variant segment in a copy number variant region based on the copy number variant region clonal copy numbers comprises:

6. An apparatus for estimating tumor purity, the apparatus comprising:

7. The apparatus of claim 6, wherein the data acquisition module comprises:

8. The apparatus of claim 6, wherein the average log read module comprises:

9. The apparatus of claim 8, wherein the mean log-read module is further configured to determine an average value of the features of the single-point mutation sites in each class according to the clustering result, and use the average value as the mean log of single-point mutation in the normal region.

10. The apparatus of claim 6, wherein the read log correction module comprises:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when executing the computer program.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 5.