WO2018201805A1

WO2018201805A1 - Method and device for use in calculating cancer sample purity and chromosome ploidy

Info

Publication number: WO2018201805A1
Application number: PCT/CN2018/078908
Authority: WO
Inventors: 黄宇; 罗志辉; 苏瑶; 范新平
Original assignee: 中国科学院上海药物研究所
Priority date: 2017-05-05
Filing date: 2018-03-14
Publication date: 2018-11-08
Also published as: CN108804876B; CN108804876A

Abstract

A method and a device for use in calculating cancer sample purity and chromosome ploidy. By means of the provided hierarchical mixed Gaussian model, rapid and accurate calculation of cancer sample purity and chromosome ploidy is achieved, thereby saving time in and the economic costs of purity estimation, while improving the accuracy of calculation results; the present invention has broad application prospects in cancer sample purity and chromosome ploidy calculation.

Description

Method and apparatus for calculating cancer sample purity and chromosome ploidy

Technical field

The present invention is in the field of cancer research, and in particular relates to a method and apparatus for calculating cancer cell purity and intracellular chromosome ploidy in a cancer sample.

Background technique

Cancer research is an important research area in life medicine and has a major impact on human health. Cancer is a kind of malignant proliferation of cells. Because of its complicated pathology, humans cannot overcome such diseases. Second generation sequencing provides the possibility to quickly detect patient genetic information. However, sequencing requires the extraction of samples from patient tissue, but usually cancer tissue does not simply contain cancer cells, it also has a very rich microenvironment. Cancer cell microenvironment refers to the environment of non-cancerous cells surrounding or accompanying cancer cells. When the cancer cell samples are extracted, these microenvironments are extracted together with the cancer cells and sequenced along with the cancer cells [1]. The proportion of cancer cells in a cancer sample is defined as the purity of the cancer sample. The cancer genome usually contains a large number of somatic cell copy number variations, which are mainly caused by amplification or deletion of genomic fragments. Identifying changes in the copy number of genomic fragments of a particular tumor genome is an important topic in cancer genome research. Accurate identification of genomic fragment copy number has certain challenges, because the cancer fragment copy number is mainly determined by a mixture of two factors, one is the purity of cancer samples, that is, the proportion of cancer cells in cancer samples, and the other is chromosome ploidy [2, 3 ]. Traditional methods for identifying cancer sample purity and ploidy are using experimental techniques such as quantitative image analysis [4] or single cell sequencing [5]. But in large projects, such an approach can cost a lot of manpower, money, and time. With the development of sequencing technology, the rapid growth of sequencing data, and the accumulation of sequencing data analysis techniques, various cancer sample purity algorithms have been proposed and corresponding software has been developed.

Calculation methods based on genomic fragment copy number variation and based on allele frequency (B-allele of mutation sites) have been proposed. The methods based on allele frequencies are PurityEst [6] and PurBayes [7], mainly depending on the purity of the tumor sample and the ploidy of the tumor genome, the frequency of the alleles will be different. Methods based on copy number variation include CNAnorm [8], THetA [9], and ABSOLUTE [10]. However, both methods have different degrees of problems. The method of using the allele frequency has a large error due to the problem of the amount of data, while the method of using copy number variation is stable, but it cannot distinguish the purity of the sample and the chromosome. The compensation effect of ploidy, that is, the identification problem. None of the above software based on fragment copy number solves this problem. CNAnorm tends to choose the closest solution to chromosome ploidy from diploid. ABSOLUTE combines other empirical data, and THetA directly lists all possible results.

A better solution would be to calculate the purity of the tumor sample in combination with the allele frequency information and the fragment copy number information. PyLOH [11], patchwork [12] used frequency information of hybrid SNV (single nucleotide variation) sites on the genome, and copy number of genomic fragments. PyLOH solves the problem of “difficulties in identification” to a certain extent, and can give a more reasonable solution. However, its accuracy is poor, especially in the case of subclone in the genome. Patchwork uses both types of information, but in the intermediate steps of calculating genotypes, manual identification is required, the results of manual judgments lack accuracy, and such semi-automated software brings a lot of inconvenience to the application.

How to make full use of the existing second-generation sequencing data to accurately calculate the purity of cancer samples and the ploidy of cancer cell genome remains a challenging task.

references

1. Junttila M R, de Sauvage F J. Influence of tumour micro-environment heterogeneity on therapeutic response [J]. Nature, 2013, 501 (7467): 346-354.

2. Carter S L, Cibulskis K, Helman E, et al. Absolute quantification of somatic DNA alterations in human cancer [J]. Nature Biotechnology, 2012, 30(5): 413-21.

3. Oesper L, Mahmoody A, Raphael B J. Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data [J]. Genome Biology, 2013, 14(7): R80.

4. Yuan Y, Failmezger H, Rueda O M, et al. Quantitative Image Analysis of Cellular Heterogeneity in Breast Tumors Complements Genomic Profiling [J]. Science Translational Medicine, 2012, 4 (157): 157ra143.

5, Navin N, Kendall J, Troge J, et al. Tumour evolution inferred by single-cell sequencing [J]. Nature, 2011, 472 (7341): 90-4.

6. Su X, Zhang L, Zhang J, et al. Purity Est: Estimating purity of human tumor samples using next-generation sequencing data. [J]. Bioinformatics, 2012, 28(17): 2265-2266.

7. Larson N B. Pur Bayes: estimated tumor cellularity and subclonality in next-generation sequencing data [J]. Bioinformatics, 2013, 29(15): 1888-9.

8. Gusnanto A, Wood H M, Pawitan Y, et al. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data [J]. Bioinformatics, 2012, 28(1): 40-47.

9. Oesper L, Mahmoody A, Raphael B J. Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data [J]. Genome Biology, 2013, 14(7): R80.

10. Carter S L, Cibulskis K, Helman E, et al. Absolute quantification of somatic DNA alterations in human cancer [J]. Nature Biotechnology, 2012, 30(5): 413-21.

11. Li Y, Xie X. Deconvolving tumor purity and ploidy by combining copy number alterations and loss of heterozygosity [J]. Bioinformatics, 2015, 30(4): 2121.

12. Mayrhofer M, Dilorenzo S, Isaksson A. Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue [J]. Genome Biology, 2013, 14(3): R24.

Summary of the invention

Definition of Terms

For a better understanding of the invention, relevant explanations and explanations are provided below:

Whole Genome Sequencing (WGS): Whole genome sequencing using second generation sequencing technology.

Read: Sequencing sequence generated by a high-throughput sequencing platform.

Sequencing depth: The ratio of the total number of bases (bp) obtained by sequencing to the size of the genome (Genome), which is one of the indicators for evaluating the amount of sequencing.

Window: A genomic fragment divided by a length that represents the size of the window. In this method, the window size can be freely set by the user, and is usually set to several hundred bases. A large genomic fragment S can contain a large number of windows.

Tumor Read Enrichment (TRE): Cancer long fragment reads the enrichment e _s, refers to the ratio of the number of read number of segments within a cancerous samples read with a corresponding segment S corresponding normal sample, the following formula is defined:

In formula (1),

with

Respectively indicates the number of reads of the covered fragment s in the cancer sample and the number of reads of the covered fragment s in the matched normal sample, N _t represents the total number of reads of the cancer sample whole genome sequencing, and N _n represents the corresponding normal sample whole genome sequencing obtained read The total number.

Heterozygous Germline Single Nucleotide Variants (HGSNV): Single base variation in heterozygous germline cells. Since human chromosomes are diploid, somatic cells are derived from embryonic cells, while HGSNV sites in germ cells have only two base types. A and B, one of which comes from the father and the other from the mother.

Major Allele Fraction (MAF): The major allele score. The HGSNV used in the present invention has only two alleles, one allele is identical to the reference genome and the other is different from the reference genome. The two allelic scores are calculated by covering the number of reads of an allele divided by the ratio of the total number of reads covering the site, and MAF is the larger of the two allelic scores. The calculation formula is as shown in (1.1), where n ^r is the number of reads containing the same allele as the reference genome, n ^a is the number of reads containing another allele, and n ^t is the total read covering the HGSNV site. The quantity, C is the MAF value of the HGSNV. MAF is a concept relative to HGSNV. In the present invention, "MAF of a fragment" refers to the MAF mean of all HGSNVs in a fragment, and "MAF of peak" refers to the MAF mean of HGSNV contained in all fragments in a peak.

Major allele copy number: the copy number of the major allele, which refers to the value of the copy number of the major allele in the fragment with copy number i, which is greater than or equal to

The integer.

Peak: refers to the TRE clusters that are clustered together in the TRE distribution of all windows in the genome. As shown in Figure 1, Figure A shows the TRE distribution of all windows on the genome, and the vertical axis shows the total number of windows corresponding to a TRE site. The figure shows the TRE distribution before the genomic GC content correction. Figure B shows the GC content correction. The TRE distribution, in Figure B, can be seen that the window is obviously clustered. This method defines the TRE cluster identified by the autoregressive model as peak, which is essentially the aggregation of the window within the genome segment with the same copy.

Cancer sample: A cancer tissue taken from an individual with cancer, which contains a portion of cancer cells and a portion of normal cells.

P: refers to the spacing between two adjacent peaks. Since peak is a cluster, the TRE of peak here is represented by the TRE mean of peak, so it is actually the difference of the TRE mean of two adjacent peaks. Since peak is an aggregation of windows within the same copy number genome segment, it is also expressed here as the difference of adjacent copy number segments TRE.

Purpose of the invention

The object of the present invention is to overcome the deficiencies of the prior art and to provide a fully automatic, high efficiency, high accuracy method and apparatus for calculating cancer sample purity and chromosome ploidy. The invention has broad application prospects in the calculation of cancer sample purity and chromosome ploidy.

Technical solutions

In order to achieve the above object, the technical solution adopted by the present invention is to construct a mixed Gaussian model of the MAF distribution of TRE and HGSNV of different copy number fragments by using whole genome sequencing data of cancer samples and matched normal samples, and calculate the purity of the cancer sample and Chromosome ploidy.

The invention mainly uses the TRE information of the whole genome sequencing data and the MAF information of the HGSNV. TRE basically reflects the copy number variation of cancer samples, and the MAF information of HGSNV basically reflects the genotype of cancer samples.

The difference in TRE is mainly due to the difference in copy number of genomic fragments. The number of reads obtained by sequencing in high copy number genomic fragments must be greater than the number of reads obtained by sequencing of low copy number genomic fragments. The difference in copy number of fragments is calculated by the difference in the number of reads in the fragments. Common methods in genome copy number detection. However, in most studies, the difference in the number of reads in the cancer sample segment was divided by the ratio of the number of reads in the normal sample to calculate the difference in the number of reads. The present invention uses the TRE shown in the formula (1) to evaluate the difference in the number of reads of different segments of the slice. The calculated ratio of the traditional method is not only affected by the purity and chromosome ploidy of the cancer sample, but also by the depth of sequencing of the cancer sample and the normal sample, and the TRE is not affected by the depth of the sample sequencing.

The genotype of each copy number fragment cannot be determined by relying solely on the difference in the number of reads, and more importantly, the compensation effect of sample purity and sample ploidy cannot be distinguished. HGSNV combined with copy number difference fragments can provide genotype information and help to solve the compensation effect of purity and ploidy. However, in previous studies, there is no efficient use of HGSNV, and most methods use enumeration. The genotypes that may correspond to different copy number fragments are listed one by one, and then the results of the permutation and combination are calculated to select the most reliable results. The common feature of these methods is that the method has a long calculation time and poor accuracy, and the sample with high copy number or large genomic variation has a poor effect. The invention calculates the purity and chromosome ploidy of the cancer sample according to the mixed Gaussian model of MAF and TRE of HGSNV, can significantly reduce the calculation time and improve the accuracy of the calculation result.

Assuming that the purity of a cancer sample is γ, the proportion of normal cells in the cancer sample is 1-γ. The normal cell has a chromosome ploidy of 2 in the cancer sample, and the cancer cell has a chromosome ploidy of κ. Then the chromosome ploidy ω of the cancer sample is as shown in the formula (2).

ω=(1-γ)×2+γ×κ (2)

It is assumed that the copy number of a certain fragment S in a cancer cell is C _S . Then the copy number C _t of the fragment S of the cancer sample should be as shown in the following formula (3).

C _t =(1-γ)×2+γ×C _s (3)

For the genomic fragment S, the TRE is calculated as shown in equation (1). The derivation formula of the expectation E(e _s ) of TRE is as shown in the following formula (4),

N _n and N _t have the same meanings as in formula (1).

To further drawn e _s, the method defines a parameter of some help understanding. Length of fragment S L _S , length of human reference genome L _gw , depth of sequencing of cancer samples

Sampling depth of normal samples

Then the fragment S is sequenced in a cancer sample to a depth of

Fragment S is sequenced in a normal sample to a depth of

λ _S refers to a parameter related to the characteristics of the segment S (such as the GC content and the like which cause the depth of preference of the sequencing), so it is the same in cancer and normal samples. Further, e _s is represented by γ, κ, C _s as shown in the formula (5).

In formula (5), C _s represents the copy number of the fragment S in the cancer cell, then the corresponding TRE mean S ⁱ and S ⁱ⁺¹ when the copy number of the fragment S is ⁱ and ⁱ⁺¹ are as shown in the formula (6), respectively. And formula (7):

By formula (6) and formula (7), for the segments corresponding to adjacent copy numbers, the difference P of their TRE is as shown in formula (8), and it can be seen that the magnitude of the P value has nothing to do with the specific copy number of the fragment. It is only determined by the purity and chromosome ploidy of cancer samples. It can be visually seen from Figure 2 that the distance between peaks is constant in the TRE profile.

Further, for segments where i=2, that is, a copy number of 2, their TRE values Q are as shown in Equation 9. The TRE value of the peak corresponding to Q in FIG. 2 is slightly larger than 1.

Through the above formulas (8) and (9), the purity (γ) and ploidy (κ) of the cancer sample can be solved as follows:

From the above analysis, it can be known that the cancer sample purity γ and chromosome ploidy κ can be calculated by determining P and Q.

As shown in Figure 2, after calculating the TRE distribution of all fragments of the whole genome, the spacing between peaks can be calculated to obtain P. In previous research methods, patchwork [12] used the ratio of the ratio of the corresponding number of reads of adjacent copy number segments to assist in the calculation of cancer sample purity, but the study could not automatically identify the gap between the read number ratios, requiring manual The recognition image determines the read number ratio interval for the next calculation, and the efficiency and accuracy are relatively low. The inventive inventive use of an autoregressive model to identify the spacing between TREs is shown in equations (12) and (13). In equations (12) and (13), X _t represents the TRE value between 0 and M _t ; t represents a TRE value that is expanded by 1000 times; M _t represents the maximum value of TRE; P represents the interval of two TRE sites. ; C(X _t ) denotes the number of windows corresponding to the position where TRE is X _t ; C(X _t+1000×P ) denotes the position where TRE is X _t+1000×P , corresponding to the number of windows; Y (P) represents the function value of the autoregressive model under P; it is obvious that when P=0, Y(P) takes the largest value, but P at this time is not the spacing between actual peaks.

With a resolution of 0.001, traverse all P values between 0 and 1, then find Y(P). The value distribution of Y(P) is as shown in FIG. According to the characteristics of formula (13), we can know that when P is equal to 0, the value of Y(P) will be the largest, but P at this time is not the spacing between peaks. We select the x-axis coordinate value P corresponding to the maximum value of Y(P) in the second peak in the figure as the calculation result of the pitch P between peaks.

The peak distribution of the peak shown in Figure B is because the TRE value of the genomic fragment with the same copy number (the mean value of the TRE of all windows in the fragment) is not completely equal, and the copy number fragment TRE is mutually There is an error between them. This error obeys a Gaussian distribution, so the clustered distribution in Figure B is considered to be a Gaussian distribution.

As shown in Figure 2, after P is determined, peak will be recognized, but some of the genomic fragments do not fall on the recognized peak. These fragments are called subclone segmentation. In the case of considering subcloned fragments, there will be an influence on the values of the Gaussian models shown in the following formulas (17) and (18), which in turn will affect the value of the final mixed Gaussian model. Since in the subsequent analysis, the present invention only needs to consider the fragment falling at the peak position, thereby eliminating the interference of the subcloned fragment.

In the TRE profile as shown in FIG. 2, the Q site represents the TRE value corresponding to the segment of copy number 2. First, we can speculate that if the copy number of a partial fragment in the cancer cell genome is 1, and the copy number of the partial fragment is 0, then there should be two peaks corresponding to the copy number 1 and 0 before the Q site. If there is no fragment with a copy number of 1, there is only a fragment with a copy number of 0, then there is a peak at a distance of 2P before the Q site, and the number of windows at the distance peak of the distance P before the Q site is 0, this is the situation shown in Figure 2. In the other case, if there are none of the segments with copy numbers 1 and 0, then the number of windows corresponding to the distance P and the distance 2P before the Q site is 0. Then for the distribution map of TRE, for X _f , that is, the first peak that appears, it may correspond to a copy number of 2 (the copy corresponding to the peak of copy number 1 and 0 is 0), and the corresponding copy number may be 1 (The fragment with a copy number of 0 corresponds to the window of the peak is 0.) It is also possible that the corresponding copy number is 0.

From the above analysis, we can know that the copy number of the first peak in Figure 2, that is, the X _f corresponding segment has several different possibilities, and each of them may make Q correspond to a different peak. The present invention calculates the copy number of the most likely X _f corresponding segment by mixing the Gaussian model, thereby determining the value of Q, and finally obtaining the purity and chromosome ploidy of the cancer sample. First we need to identify all possible values for the copy number of the corresponding fragment of X _f . The present invention determines the value of X _f by the following formula (13.1). In (13.1), C(X _f +P) represents the number of windows corresponding to the position where TRE is X _f + P, and n represents the maximum number of peaks within M _t . When f(X _f ) takes the maximum value, X _f is the TRE mean of the first peak.

Then use Equation (13.2) to find up to a maximum of several peaks before X _f . Wherein X _f represents a mean peak of TRE, P denotes the distance between the copy number of segments corresponding to an adjacent peak, floor represents rounded down, when N = 0, indicating no prior peak X _{_f,} X _f corresponds The fragment copy number is 0; when N=1, it means that there may be at most 1 peak before X _f , or there may be no peak; when N=2, it means there may be up to 2 peaks before X _f , or maybe only one Peak or no peak;

For the case where there may be (1, 2, 3...N) peaks before X _f , in each case, a corresponding Q value can be calculated by the following formula (13.3). According to the definition of Q, we know that Q is the TRE value corresponding to the peak of the fragment with a copy number of 2. First, it can be inferred that the TRE value of the fragment with a copy number of 0 is X _f -n×P, where n represents the number of peaks before X _f , the value ranges from 0 to N, and P represents the adjacent copy number. The spacing between the peaks corresponding to the segments, the meaning of X _{f is the} same as in the formula (13.1), and the formula (13.3) is as follows, where Q _n represents the value of Q when there are n peaks before X _f .

Q _n =X _f -n×P+2×P=X _f +(2-n)×P,n∈[0,N] (13.3)

According to the above analysis, it can be obtained that there may be ( ₀ , ₁ , ₂ , 3...N) peaks before X _f , and the value of Q _n may be ( Q ₀ , Q ₁ , Q ₂ , Q ₃ . ..Q _N ). While the previous autoregressive model has calculated P, then for each possible Q, we can calculate the corresponding γ and κ by equations (10) and (11). The invention calculates the most likely peak number n before X _f by mixing the Gaussian model, thereby determining the copy number of the fragment corresponding to X _f , and then determining the value of Q, and finally obtaining the purity and chromosome ploidy of the cancer sample. The specific method is explained as follows.

For each possible value of Q _n , in combination with P, we can calculate the corresponding γ and then calculate the theoretical value of the MAF of HGSNV in each copy number segment. The theoretical calculation method is shown in formula (14). C _mcp represents the major allele copy number, and C _cp represents the overall copy number of peak. f represents the theoretical value of the MAF within the peak.

However, in actual cases, when the sequencing depth is relatively low, f has a large error with the true value (expected value). Here, further correction f is required, and the correction method is as shown in equations (15) and (16). In equation (15), m is the mean of the number of reads in all windows in a peak, and v is the variance of the number of reads in all windows in the peak. The obtained p is the probability of success of the random variable used in the negative binomial distribution, r For the number of times a random variable fails, the random variable d here is the read coverage obtained by sequencing.

At a certain sequencing depth, d, the MAF actually obeys the binomial distribution with f as the probability and d as the number of experiments. The present invention corrects f by the following formula (16) to obtain the expected value f _{b of the} MAF. In the formula, k represents the number of alleles (A or B) at a certain HGSNV site, and the total amount of alleles measured is d (equal to the sequencing depth).

Equation (13.2) indicates that there are N possibilities for the copy number of each genomic fragment. Equation (14) shows that a certain copy number fragment can have multiple major allele copy numbers, so for each genome peak, multiple f can be calculated, and multiple f _b can be calculated, and the average observed MAF value is taken from the distance peak. The most recent f _{b is} expected as the MAF of the peak. The whole genome has multiple peaks, and the MAF expectation of each peak is different, corresponding to calculating multiple MAF expected values {f _b }. Considering that there is a certain error in the MAF of HGSNV in a certain peak, but also obeys the Gaussian distribution, the expected value of MAF of all HGSNV in peak can be directly calculated from the actual data. Assuming that the genotype of a certain peak is certain, the copy number and genotype of the peak can be judged by comparing the MAF observation of peak with the value of {f _b } of peak. In other words, the TRE value corresponding to the peak of the copy number 2, that is, the position of Q can be calculated. At the same time, in order to further correct P, the present invention proposes a hybrid Gaussian model to fit the observation data of TRE and HGSNV.

It can be seen from the previous analysis that X _t does not accurately represent the TRE mean of each peak because of the existence of ε _t in equation (12). The Gaussian distribution model of TRE is shown in Equation 17:

Where L(e _s ; γ, κ) represents the likelihood function of the genomic fragment TRE. N represents the number of all windows on the genome. I represents the largest copy number of all fragments in the genome. σ _i represents the standard deviation of the TRE of all segments of copy number i. e _s is the TRE observation of the sth window, and S ⁱ represents the TRE mean of the ith peak. p _i represents the weight of the copy number of the sth window, i, and all i, p _i in this formula take the value 1. The formula shows that the size of the likelihood function is related to the value of S ⁱ . When S ⁱ and e _S are closer, the value of the likelihood function is larger, and the closer the P value is to the true value. A reasonable P value can be calculated using the maximum likelihood estimation of L(e _s ; γ, κ).

However, in some cases, when P fluctuates within a small interval (such as [-0.005, 0.005]), the corresponding likelihood function values may be the same. The present invention further determines the P value and determines the Q value by combining the Gaussian distribution model of HGSNV as shown in the formula (18).

Where L(f _s ; γ, κ) represents the likelihood function of HGSNV. M represents all HGSNV in the genome. S represents the Sth HGSNV. I represents the largest copy number of all fragments in the genome. F ^i,j is the expected value of MAF of HGSNV in the fragment whose copy number is i and the copy number of the major allele is j, that is, f _b calculated by the formula (16). f _s represents the mean value of the observed values of the MAF within the segment, and σ _i,j represents the standard deviation of the MAF observations of the HGSNV within the segment. p _i,j denotes the weight of the Gaussian distribution when the copy number of the major allele is j, and the value of all i and j, p _i,j is 1. p _i represents the weight of the copy number of the segment where the Sth HGSNV is located, and the value of all i, p _i is 1. Equation (18) shows that the size of the likelihood function is related to the value of F ^i,j . When F ^{i,j is} closer to f _s , the larger the likelihood function, the more accurate f _{s is} , and the formula (14) is also shown. The more accurate f is, the corresponding C _cp and C _{mcp of} each fragment can be obtained. Then the value of Q is determined. In order to get the most accurate P and Q, the method adds equation (17) and formula (18) to obtain a mixed Gaussian model.

However, the statistical calculation of the hybrid model is prone to model overfitting. The present invention further uses a Bayesian Information Criterion (BIC) method to give a mixed Gaussian model a penalty function for controlling over-fitting of the model. The final mixed Gaussian model is shown in equation (19):

BIC(e _s ,f _s ;γ,κ)=-2×log L(f _s ;γ,κ)-2×log L(e _s ;γ,κ)+I×log(N)+J×log (M) (19)

Among them, BIC(S _s , f _s ; γ, κ) represents the likelihood function of the mixed model, I is the number of Gaussian distributions in equation (17), and J is the number of Gaussian distributions in equation (18). N is the number of windows in the genome, and M is the number of HGSNVs in the genome.

By traversing the interval of [P-0.02, P+0.02], the maximum likelihood estimation is obtained for all (P, Q _n ), and the most suitable P value and Q value can be obtained and then according to formulas (10) and (11). The purity and chromosome ploidy of the cancer sample can be calculated.

Accordingly, one aspect of the invention provides a method for calculating cancer cell purity and ploidy ploidy in a cancer sample, the method comprising the steps of:

Step A:

Obtaining genome-wide sequencing (WGS) data from paired cancer tissue samples (from the same cancer patient) and normal tissue samples, and comparing the sequencing data to the reference genome;

Step B:

From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):

In formula (1.1), n ^r is the number of reads containing the same allele as the reference genome, n ^a is the number of reads containing another allele, and n ^t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;

Step C:

According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;

Step D:

Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:

In formula (1),

with

Represents the number of reads in the cancer sample covering the segment s (here the window) and the number of reads covering the segment s in the normal sample, N _t represents the total number of reads of the cancer sample, and N _n represents the total number of reads of the corresponding normal sample, e _s Is the TRE value;

Step E:

Based on the genomic fragments treated by BIC-seq in step D, the mean, variance, and number of windows in the window of all windows in the segment are counted, and the number of windows per segment of the genome is smoothed according to the mean and variance (smooth Processing, the distribution of TRE is more uniform, and then the window distribution of all segments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean value of MAF of all HGSNVs in the segment is calculated in units of fragments. variance;

Step F:

Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, by traversing a certain range of P, and calculating Y(P), at Y ( In the distribution of P), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:

In equations (12) and (13), X _t represents the TRE value between 0 and M _t ; t represents a TRE value that is expanded by 1000 times; M _t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X _t ) represents the number of windows corresponding to the position where TRE is X _t ; C(X _t+1000×P ) represents the number of windows corresponding to the position where TRE is X _t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;

Step G:

According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q _n , where step G may include:

G1:

According to the P calculated in step F, using formula (13.1), select X _{f which} takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:

In formula (13.1), i represents the ith peak, C(X _f + P × i) represents the position where TRE is X _f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M _t , M _t represents the maximum value of TRE;

G2:

Using formula (13.2), calculate the maximum number of peaks N that may exist before X _f based on P calculated in step F and X _f calculated in step G1:

In formula (13.2), X _f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;

G3:

Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q _n is calculated using equation (13.3):

Q _n =X _f -n×P+2×P=X _f +(2-n)×P,n∈[0,N] (13.3)

In formula (13.3), n represents the number of peaks before X _f , the value ranges from 0 to N, P represents the spacing between peaks of adjacent copy number segments, and X _f represents the first actual observation. The TRE mean of peak, Q _n represents the Q value when there are theoretically n peaks before X _f ;

Step H:

Using the P calculated in step F and all possible Q _n calculated in step G, the cancer sample purity γ and chromosome ploidy κ are calculated using equations (10), (11):

In formulas (10) and (11), γ represents the purity of the sample, and κ represents the ploidy of the chromosome, so that the corresponding (γ, κ) can be obtained for all (P, Q _N );

Step I:

When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:

T _i =X _f -n×P+i×P=X _f +(in)×P,n∈[0,N] (13.4)

In formula (13.4), n represents the number of peaks before X _f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X _f represents the first actual observation. The TRE mean of peak, T _i represents the TRE mean of the ith peak,

For a fragment that falls near T _i , the fragment is considered to have a copy number i; for a fragment that does not fall near T _i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity γ and peak, calculate the expected f _b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, finally obtain the desired set of MAF {f _b }; TRE mean and variance (or standard deviation) for each peak;

Step J:

According to P calculated in step F and {f _b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J can include the following steps:

J1:

Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:

In equation (17), L(e _s ; γ, κ) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ _i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e _s is the TRE observation value of the sth window, and S ⁱ represents the TRE mean value of the i th peak, that is, T _{i in} step I, p _i represents the first The copy number of s windows is the weight of i, and the value of all i, p _i is 1;

J2:

Construct a Gaussian distribution model as shown in equation (18) with f _b calculated in step I:

In formula (18), L(f _s ; γ, κ) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; ^{i, j} represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f _s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E , σ _i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p _i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p _{i, j have a} value of 1, p _i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p _i is 1;

J3:

Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC (Bayesian Information Criterion) correction on the mixed model to obtain the final mixed model as Equation (19):

In the formula (19), BIC(e _s , f _s ; γ, κ) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.

For each integer value n in the range [0, N], Q _n can be obtained by step G, or the desired set of MAFs of all peaks {f _b } can be obtained by step I, and a pair (P, {f _b }) It is possible to construct a model shown in equation (19), essentially for each pair (P, Q _n ), to construct a model as shown in equation (19);

Step K:

With 0.001 as the resolution, repeating steps G to J for all P values in the [Pm, P+m] interval, a series of different (P, Q _n ) and corresponding likelihood function values can be obtained, taking the maximum The function value corresponds to (P, Q _n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;

Step L:

Looking at the results of step H, the corresponding cancer sample purity and chromosome ploidy can be found under (P, Q) obtained in step K.

In another aspect, the invention provides an apparatus for calculating cancer cell purity and chromosome ploidy in a cancer sample, comprising a processor for running a program, the program running the following steps:

Step A:

Obtaining genome-wide sequencing (WGS) data from paired cancer tissue samples (from the same cancer patient) and normal tissue samples, and comparing the sequencing data to a reference genome;

Step B:

Step C:

Step D:

In formula (1),

with

Step E:

Step F:

Step G:

G1:

G2:

G3:

Q _n =X _f -n×P+2×P=X _f +(2-n)×P,n∈[0,N] (13.3) In the formula (13.3), n represents the number of peaks before X _f , The value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, X _f represents the TRE mean of the first actually observed peak, and Q _n represents the theoretical existence before X _f Q value at n peaks;

Step H:

Step I:

T _i =X _f -n×P+i×P=X _f +(in)×P,n∈[0,N] (13.4)

In formula (13.4), n represents the number of peaks before X _f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X _f represents the first actual observation. The TRE mean of peak, T _i represents the TRE mean of the ith peak.

For a fragment that falls near T _i , the fragment is considered to have a copy number i; for a fragment that does not fall near T _i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculated the copy number of the cancer sample purity γ and peak, the expected f _b of the MAF of peak can be calculated, the MAF of different peaks is expected to be different, and for all peaks on the genome, the desired set of MAF {f _b }; Calculate the TRE mean and variance (or standard deviation) of each peak at the same time;

Step J:

J1:

J2:

J3:

Step K:

Step L:

As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step A, the reference genome hs37d5 (ftp://ftp) used in the phase 3 project of the 1000 genome project is adopted. .1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz) As the reference genome of the present invention, it contains all chromosomes and decoy sequences in GRCh37. The comparison software uses Burrows-Wheeler Aligner (BWA), and the comparison method uses bwa mem, and finally obtains the bam format file of the comparison result of cancer and normal samples.

As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in step B, samtools software is used to extract the position and length information of the read, the HGSNV site and the read covering the site. Quantity information. When the read information is extracted using the samtools view command, the sequence whose sequence alignment quality (MAPQ) is lower than 31 is filtered out (parameter -q 31, q indicates that the sequence with poor sequencing quality is filtered out), and the read that fails to match correctly is filtered out. The parameter -f 0x2 -F 0x18, f indicates that the sequence that meets certain requirements is extracted, and F indicates that the sequence that meets certain requirements is filtered. When extracting HGSNV information using the samtools mpileup command, filter out the sequence with sequence alignment quality (MAPQ) lower than 20 (parameter -q 20), and filter out the sequence with base quality less than 20 (parameter -Q 20, Q means filtered out) Sequence of poor base quality). When the allele frequence is selected, the present invention uses the -1 parameter of samtools mpileup. To use this parameter, you need to prepare a bed format file containing SNP location information in advance. The method of the present invention collects in advance the 1000 genome (genome) program (http://www.internationalgenome.org/), the heterozygous allele locus based on a large number of samples, and filters out the B-allele frequency ( B-allele frequence) is less than 0.05 and is then made into a bed file. The use of the "-1" parameter greatly accelerates the extraction speed of the HGSNV site and improves the operating efficiency of the device on the basis of ensuring sufficient HGSNV sites.

As a preferred solution, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step C, the step C may include four steps:

C1, the whole genome is divided according to a window of a certain base length, and the number of reads of the window is covered for each window statistics, and the position of the read is represented by the midpoint of each read in the statistics;

C2. Create an index file for the reference genome to increase the statistical speed of the GC content;

C3, taking the GC content of each window as an independent variable, taking the number of reads of each window as the dependent variable, and fitting the function of the read quantity with the GC content;

C4. Adjust the number of whole genome reads using the fitted model.

As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step C2, the present invention creates a GC content index file for the reference genome. The cumulative number of guanine (G) and cytosine (C) in the region of 1, 5, 25, and 125 base intervals for each chromosome. Then, when counting the GC content in a window, you can use the fast algorithm to extract a*125+b*25+c*5+d*1 (where a, b, c, d represent coefficient variables). For example, if you want to count the GC content in a 380 bp region, you can decompose it into 3*125+1*5 format. Then you only need to read the GC content in a certain base of a certain index file and the 125 base in a certain region. The GC content in the base can be. At the same time, the present invention stores the index file in a binary format, which greatly speeds up the extraction of the GC content of a specific region.

As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of cancer samples, in the step C3, the present invention uses the respective window GC contents extracted in steps C1 and C2 to fit the read through the following elastic network model. The amount varies with GC content. The invention uses the GC content of the window as the variable x, uses x, x ² , x ³ , x ⁴ , x ⁵ , x ⁶ as the input variables of the elastic network model, and uses the number of reads as the output variable to construct an elastic network model such as a formula ( 20) shown. Where y represents the number of reads observed in the window, X represents the input variable matrix, β represents the variable coefficient matrix, j represents the variable coefficient subscript, P represents the total number of coefficients, and λ ₁ and λ ₂ represent the penalty coefficients.

As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step C4, the model in step C3 is used to predict the theoretical read number μ _{gc of} each window, and the average GC content of the genome. Defined as μ, the number of reads observed in the window is defined as y, and the number of corrected reads in the window is Y. Then the correction formula is as follows (21):

As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step D, the present invention calculates the TRE value of each window using the formula (1). The whole genome was then segmented using the BIC-seq software using the value of TRE. The idea of BIC-seq is to use the Bayesian Information Criterion (BIC) algorithm to count the BIC values of adjacent windows. The smaller the value, the more similar the two windows are, and then merge the windows with BIC values less than 0. Finally, BIC-seq will follow the fragment. The difference in copy number divides the whole genome into different segments. Each segment has a different TRE mean value than the adjacent segment, that is, there is a difference in copy number.

As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step E, the number of windows included in the fragment is calculated by using the genomic fragment after the BIC-seq processing in the step D, The mean and variance of the TRE. The TRE of the segment is then subjected to smooth processing. The processing method is as shown in formula (22). For each genomic fragment, the mean value of TRE is taken as the mean μ of the normal distribution, and the variance of TRE is taken as the variance σ of the normal distribution, and the distribution of the window number of TRE in the range of [μ-2σ, μ+2σ] is calculated. The definition v is the TRE coordinate, the value range is [μ-2σ, μ+2σ], the resolution is 0.000, C _{win is} the number of windows allocated to the v-site, and C _T is the total number of windows in the segment. After the window of all segments is based on the TRE value, the number of windows in the segment is normally distributed, and the number of windows corresponding to each TRE site of all segments is summed and summarized, and the distribution of the window-wide window with the TRE can be obtained.

As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step F, traversing all Ps in the range of (0, 1) with a resolution of 0.001, using an autoregressive model. Calculate the value of Y(P). Y(P) appears as a multimodal distribution, similar to that shown in Figure 3, where the horizontal axis is P and the vertical axis represents Y(P), and the present invention uses the second peak Y (P) The maximum value corresponds to P as the calculation result of P, and M _t is the maximum value of TRE, where M _{t is} set to 3.

As a preferred solution, in the above method and apparatus for calculating the purity and chromosome ploidy of a cancer sample, in the step G, the step G includes three steps, and in the step G1, the TRE interval of [0, 1] is used as the variable X _{f .} For the range of values, filter out the TRE site where C(X _f ) is less than 1000, and calculate the X _f when the maximum value of the formula (13.1) is taken as the mean point of the first peak.

As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step I, according to the copy number of the cancer sample purity γ and peak calculated according to the step H, the expectation of the MAF of the peak can be calculated. f _b . Step I can include three steps of I1, I2, and I3.

I1, calculate the MAF theoretical value of HGSNV in peak using formula (14). In formula (14), C _mcp represents the major allele copy number, and C _cp represents the overall copy number of peak. I get, f represents the theoretical value of the MAF in the peak, it can be seen that when C _{cp is} large, f has many different possible values.

I2, using the negative binomial distribution to estimate the probability of covering the total number of reads per HGSNV site, and using equation (15) to calculate the probability p and the number of failures r of the negative binomial distribution. In equation (15), m is the mean of the number of reads in all windows in the fragment, and v is the variance of the number of reads in all windows in the fragment. The obtained p is the probability of success of the random variable used for the negative binomial distribution, r For the number of times a random variable fails, the random variable is the number of reads in a certain HGSNV.

I3, the probability of covering the number of reads of a certain HGSNV obtained by the binomial distribution. In combination with a certain number of reads, HGSNV has only two genotypes, subject to the binomial distribution law, and uses equation (16) to calculate the correction value f _{b of} f (ie, the expectation of f). With a peak, different C _mcp can calculate different f _b, and selecting the closest mean value of the MAF observed as the peak of the peak is f _b f _b. In formula (16), k represents the number of alleles (A or B) at a certain HGSNV site, d is the number of reads covering the HGSNV, r is the number of failed random variables, and p is used for The probability of success of a random variable with a negative binomial distribution;

For each Q _n , it can be inferred that the copy number and cancer sample purity corresponding to all peaks of the genome are obtained, so that f _b can be obtained for each peak, and then the expected value set {f _b } of the MAF of all peaks can be obtained.

As a preferred embodiment, in the above method and apparatus for calculating the purity and chromosome ploidy of a cancer sample, in the step K, m is 0.02, and the traversal interval of the P value is [P-0.02, P+0.02].

Through the hierarchical mixed Gaussian model provided by the invention, the rapid and accurate calculation of the purity of the cancer sample is realized, the time and economic cost of the purity estimation are saved, and the accuracy of the calculation result is improved.

DRAWINGS

Figure 1 shows the distribution of the number of windows in the whole genome on the TRE. Among them, Figure A shows the TRE distribution that has not undergone GC correction, and Figure B shows the TRE distribution after GC content correction.

Fig. 2 shows a model of TRE distribution in a cancer cell. The figure shows that after the smooth treatment, the peak in the figure satisfies the distribution with a period of P, and a small number of small peaks which do not satisfy the periodic distribution are considered to be subcloned fragments. Q represents a peak with a copy number of 2, and there is no segment with a copy number of 1, so the number of windows of the peak at a position of about 0.6 is zero.

Figure 3 shows that the horizontal axis is the vertical axis of the P-class autoregressive model.

Figure 4 shows a flow chart of the method and apparatus of the present invention.

detailed description

The present invention will be further described with reference to the accompanying drawings and specific embodiments. However, the examples are provided for illustrative purposes only, and the scope of the invention is not limited to the embodiments.

A flow chart for calculating cancer sample purity and chromosome ploidy using the apparatus of the present invention is shown in FIG.

In the examples, the experimental material used was the normal tissue TCGA-AD-A5EJ-10A and the cancer tissue TCGA-AD of the sample (TCGA-AD-A5EJ) downloaded from the TCGA (https://cancergenome.nih.gov/) database. -A5EJ-01A whole genome sequencing data. The computing platform is ubuntu 16.04, and the specific implementation of the method is C++, Python, and R programs.

EXAMPLES: The purity and ploidy of cancer samples were calculated using a hierarchical mixed Gaussian model based on genome-wide sequencing data for cancer tissues and normal tissues of sample TCGA-CM-4746.

1. Collect sample data and download the whole genome sequencing data of tumor samples and normal samples of TCGA-CM-4746-01A in TCGA. The cancer sample bam file size is 12.6G, and the normal sample bam file size is 10.1G. The bam file is processed into a fastq file using the PICARD software. The fastq was compared to the reference genome hs37d5 using bwa mem to obtain a new cancer sample and a normal sample bam file with file sizes of 12.4G and 9.9G, respectively.

2. Download the vcf file of chromosomes 1 to 22 provided by the 1000genome project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), and use the SelectVariants method of GATK to extract the reference genome hs37d5. The BIALLELIC site with an allele frequency greater than 5% in .fa serves as a potential HGSNV site, resulting in 5633774 biallele sites.

Third, extract the read information of normal samples and cancer samples, and extract the HGSNV information of the cancer samples. Using samtools to extract the sequence coverage and HGSNV of cancer samples, 67732 HGSNV were obtained. When extracting HGSNV, the biallele site obtained in the above step 1 is used as an alternative site form. Use the samtools view method to extract HGSNV directly from the alternate form to speed up the extraction.

4. The index file for establishing the GC content of the reference genome by using the 500 bp window, and the reference genome hs37d5.fa file downloaded in the above step 1, the GC content index file in the 1, 5, 25, 125 segment is established. Stored in binary format.

5. Using 500bp as a window, count the number of reads in each window in the genome-wide range. Also use the index file generated in step four to calculate the GC content in each window. The GC content was corrected for the number of reads by the elastic network model.

6. For each window, calculate the TRE using the corrected number of reads. The genome was fragmented by BIC-seq according to TRE. The results of the fragmentation are shown in Table 1. Each column of data represents the positional information of a genomic fragment and the mean value of the TRE, the variance and the number of windows in the fragment.

Table 1 Results of BIC-seq fragmentation of the genome

染色体号Chromosome number	起始Start	终止termination	TRE均值TRE mean	TRE方差TRE variance	Windows数量Number of Windows
chr1Chr1	1300113001	4526550045265500	0.9445080.944508	0.00147420.0014742	8612886128
chr1Chr1	4526550145265501	8597800085978000	0.9454540.945454	0.001332010.00133201	8097080970
chr1Chr1	8597800185978001	8601150086011500	1.273621.27362	0.09813210.0981321	6868
chr1Chr1	8601150186011501	116069000116069000	0.949150.94915	0.001530580.00153058	5877558775

chr1Chr1	116069001116069001	120339500120339500	1.013231.01323	0.004378910.00437891	84888488
chr1Chr1	120339501120339501	143744000143744000	1.074921.07492	0.0163370.016337	24422442
chr1Chr1	143744001143744001	144707500144707500	1.364611.36461	0.05141540.0514154	469469
chr1Chr1	144707501144707501	145290000145290000	1.469031.46903	0.02527440.0252744	887887
chr1Chr1	145290001145290001	145833000145833000	1.736661.73666	0.02629440.0262944	936936
chr1Chr1	145982001145982001	148248000148248000	1.338951.33895	0.01312230.0131223	23062306
chr1Chr1	148248001148248001	149200000149200000	1.682141.68214	0.03815420.0381542	725725
chr1Chr1	149200001149200001	249240500249240500	1.33831.3383	0.001149540.00114954	196613196613
chr2Chr2	1000110001	4140300041403000	0.9487690.948769	0.001342010.00134201	8186581865
chr2Chr2	4140300141403001	5131700051317000	0.5632510.563251	0.001833060.00183306	1971719717
chr2Chr2	5131700151317001	9170150091701500	0.9503250.950325	0.001458090.00145809	7555075550
chr2Chr2	9170150191701501	9182450091824500	1.234281.23428	0.06052570.0605257	188188
chr2Chr2	9182450191824501	233387500233387500	0.9526520.952652	0.0007443770.000744377	271202271202
chr2Chr2	233387501233387501	243186500243186500	0.9364870.936487	0.003085930.00308593	1903219032
chr3Chr3	6000160001	197956500197956500	0.9521340.952134	0.0006134780.000613478	386567386567
chr4Chr4	1000110001	68900006890000	0.9596570.959657	0.004179210.00417921	1352013520
chr4Chr4	68900016890001	191044500191044500	0.9527030.952703	0.0006336570.000633657	358360358360
chr5Chr5	1150111501	1261600012616000	0.559990.55999	0.001804060.00180406	2489524895
chr5Chr5	1261600112616001	180901000180901000	0.9486630.948663	0.0006765330.000676533	323787323787
chr6Chr6	124001124001	2482700024827000	0.9415090.941509	0.001712210.00171221	4904149041
chr6Chr6	2482700124827001	2598250025982500	0.6125320.612532	0.005445560.00544556	23112311
chr6Chr6	2598250125982501	171051000171051000	0.9533120.953312	0.0007257610.000725761	280819280819
chr7Chr7	1000110001	56305005630500	0.9544140.954414	0.004618920.00461892	1097610976
chr7Chr7	56305015630501	3830650038306500	0.9557960.955796	0.001496650.00149665	6473264732
chr7Chr7	3830650138306501	3839400038394000	1.242151.24215	0.03689860.0368986	176176
chr7Chr7	3839400138394001	5508750055087500	0.9505290.950529	0.002122480.00212248	3313033130
chr7Chr7	5508750155087501	7328350073283500	0.9475450.947545	0.002722140.00272214	2715627156
chr7Chr7	7328350173283501	142340000142340000	0.9556590.955659	0.001074450.00107445	134626134626
chr7Chr7	142340001142340001	142491000142491000	1.143911.14391	0.02598090.0259809	303303
chr7Chr7	142491001142491001	159128500159128500	0.9658120.965812	0.002518710.00251871	3191431914
chr8Chr8	1150111501	4685750046857500	1.326351.32635	0.001757070.00175707	8437384373
chr8Chr8	4685750146857501	4774450047744500	1.013831.01383	0.0156170.015617	17351735
chr8Chr8	4774450147744501	146304000146304000	1.330531.33053	0.001120190.00112019	194805194805
chr9Chr9	1050110501	8962400089624000	0.9569590.956959	0.001208250.00120825	119638119638
chr9Chr9	8962400189624001	9004350090043500	1.259751.25975	0.01673810.0167381	836836
chr9Chr9	9004350190043501	9234300092343000	1.919591.91959	0.0104420.010442	45444544
chr9Chr9	9234300192343001	9335200093352000	3.248033.24803	0.02578860.0257886	14981498
chr9Chr9	9335200193352001	9369600093696000	3.542673.54267	0.03992310.0399231	683683
chr9Chr9	9369600193696001	9423200094232000	2.922732.92273	0.02837350.0283735	10681068
chr9Chr9	9423200194232001	9507650095076500	3.235263.23526	0.02493890.0249389	16681668
chr9Chr9	9507650195076501	9508000095080000	1.473831.47383	0.23430.2343	88

chr9Chr9	9508000195080001	9509900095099000	0.5805550.580555	0.04611020.0461102	3939
chr9Chr9	9509900195099001	124413000124413000	0.948530.94853	0.001617150.00161715	5816258162
chr9Chr9	124413001124413001	124419500124419500	1.574381.57438	0.2011910.201191	1414
chr9Chr9	124419501124419501	141128000141128000	0.9441090.944109	0.002480410.00248041	3261532615
chr10Chr10	6600166001	135525000135525000	0.9462170.946217	0.0007872180.000787218	255445255445
chr11Chr11	113001113001	134946500134946500	0.9509480.950948	0.0007772760.000777276	259377259377
chr12Chr12	6050160501	9350093500	1.884811.88481	0.2067040.206704	5252
chr12Chr12	9350193501	100378000100378000	0.9495640.949564	0.0008744330.000874433	193005193005
chr12Chr12	100378001100378001	133841500133841500	0.9439860.943986	0.001556710.00155671	6585265852
chr13Chr13	1902050119020501	115110000115110000	0.9841160.984116	0.0008933380.000893338	189923189923
chr14Chr14	1900000119000001	2253350022533500	0.9813440.981344	0.006612040.00661204	56385638
chr14Chr14	2253350122533501	2303800023038000	1.121461.12146	0.01514370.0151437	10091009
chr14Chr14	2303800123038001	7425350074253500	0.9487940.948794	0.001197840.00119784	101769101769
chr14Chr14	7425350174253501	7425450074254500	3.639013.63901	1.127181.12718	33
chr14Chr14	7425450174254501	107289500107289500	0.9432910.943291	0.001523860.00152386	6575065750
chr15Chr15	2000000120000001	2934600029346000	0.9401280.940128	0.003974120.00397412	1393313933
chr15Chr15	2934600129346001	102521500102521500	0.9461930.946193	0.001047060.00104706	141616141616
chr16Chr16	6000160001	3264650032646500	0.9502610.950261	0.001844980.00184498	6029660296
chr16Chr16	3264650132646501	3379650033796500	0.8893390.889339	0.01782430.0178243	11721172
chr16Chr16	3379650133796501	9028200090282000	0.9450440.945044	0.001359850.00135985	8930589305
chr17Chr17	11	61090006109000	0.919870.91987	0.003963130.00396313	1183611836
chr17Chr17	61090016109001	61255006125500	1.979351.97935	0.1638980.163898	3434
chr17Chr17	61255016125501	1665350016653500	0.9229420.922942	0.002867450.00286745	2094520945
chr17Chr17	1665350116653501	1673850016738500	0.7399320.739932	0.05005930.0500593	142142
chr17Chr17	1673850116738501	2226250022262500	0.9593840.959384	0.005575730.00557573	99589958
chr17Chr17	2226250122262501	2709750027097500	1.710121.71012	0.01065130.0106513	36473647
chr17Chr17	2709750127097501	3443250034432500	0.932980.93298	0.003395140.00339514	1457014570
chr17Chr17	3443250134432501	3450850034508500	1.282161.28216	0.0506950.050695	121121
chr17Chr17	3450900134509001	3621550036215500	0.914580.91458	0.007293530.00729353	28902890
chr17Chr17	3621550136215501	3641500036415000	1.08671.0867	0.03809840.0380984	263263
chr17Chr17	3641500136415001	3942150039421500	0.9498910.949891	0.005828020.00582802	59845984
chr17Chr17	3942150139421501	3943200039432000	0.2491880.249188	0.03862970.0386297	2020
chr17Chr17	3943200139432001	5199600051996000	0.9373820.937382	0.002644610.00264461	2421024210
chr17Chr17	5199600151996001	5203700052037000	1.37251.3725	0.05432020.0543202	8383
chr17Chr17	5203700152037001	5532200055322000	0.9382070.938207	0.004659890.00465989	65396539
chr17Chr17	5532200155322001	5563250055632500	0.6079870.607987	0.01020.0102	622622
chr17Chr17	5563250155632501	6860950068609500	0.9428950.942895	0.002468780.00246878	2557525575
chr17Chr17	6860950168609501	6874500068745000	1.263741.26374	0.02355190.0235519	272272
chr17Chr17	6874500168745001	8119500081195000	0.9435620.943562	0.002872590.00287259	2428524285
chr18Chr18	1000110001	99885009988500	1.327621.32762	0.003514050.00351405	1981719817
chr18Chr18	99885019988501	7801750078017500	0.9448280.944828	0.001065340.00106534	128567128567

chr19Chr19	8900189001	2396450023964500	0.9401280.940128	0.002117550.00211755	4661846618
chr19Chr19	2396450123964501	2402800024028000	0.5478030.547803	0.02436620.0243662	128128
chr19Chr19	2402800124028001	2462100024621000	0.9153630.915363	0.01189830.0118983	11551155
chr19Chr19	2462850124628501	5911900059119000	1.270991.27099	0.002162960.00216296	6194061940
chr20Chr20	6000160001	2942300029423000	0.9446340.944634	0.001665960.00166596	5192951929
chr20Chr20	2942300129423001	6296550062965500	1.64131.6413	0.00241250.0024125	6617066170
chr21Chr21	94110019411001	4006700040067000	0.9384750.938475	0.001646720.00164672	5252552525
chr21Chr21	4006700140067001	4044200040442000	0.6056130.605613	0.008974340.00897434	747747
chr21Chr21	4044200140442001	4812000048120000	0.9585540.958554	0.003691370.00369137	1499614996
chr22Chr22	1605000116050001	5123500051235000	0.9447870.944787	0.001744550.00174455	6659166591

7. Through step 6, the mean and variance of the TRE of each segment are obtained, and the number of windows included in the segment. Using the normal distribution method, the TRE mean and variance of each segment are used as the mean and variance of the normal distribution, and the windows in the segment are smoothed according to the normal distribution. Summarize the TRE of all clips and the corresponding number of windows.

8. Autoregressive analysis of the number of TRE windows after smooth, and the value of P is 0.386.

9. When P is equal to 0.386, the TRE average of the first actual observed peak is 0.562. There can be at most one theoretical peak before the first actual observation peak, that is, N=1. The possible Q is: Q ₀ = 1.334, Q ₁ = 0.948. The likelihood function values of the mixed Gaussian models of the two Qs are 1.77E+07 and 1.78E+07, respectively.

X. Calculate the maximum likelihood of the mixed Gaussian model after BIC correction in the value range [P-0.02, P+0.02]. The calculation results are shown in Table 2.

Table 2 results of mixing Gaussian models in the range of values of P

P值P value	Q值Q value	似然函数值Likelihood function value	P值P value	Q值Q value	似然函数值Likelihood function value
0.3660.366	0.9320.932	1.12E+061.12E+06	0.3860.386	1.3341.334	1.77E+071.77E+07
0.3660.366	1.2981.298	1.12E+061.12E+06	0.3870.387	0.9480.948	1.78E+071.78E+07
0.3670.367	0.9330.933	849906849906	0.3870.387	1.3351.335	1.77E+071.77E+07
0.3670.367	1.31.3	847125847125	0.3880.388	0.9480.948	1.84E+071.84E+07
0.3680.368	0.9340.934	795735795735	0.3880.388	1.3361.336	1.84E+071.84E+07
0.3680.368	1.3021.302	792858792858	0.3890.389	0.9480.948	1.88E+071.88E+07
0.3690.369	0.9350.935	832922832922	0.3890.389	1.3371.337	1.87E+071.87E+07
0.3690.369	1.3041.304	830037830037	0.390.39	0.9480.948	1.87E+071.87E+07
0.370.37	0.9360.936	1.05E+061.05E+06	0.390.39	1.3381.338	1.87E+071.87E+07
0.370.37	1.3061.306	1.04E+061.04E+06	0.3910.391	0.9480.948	1.88E+071.88E+07
0.3710.371	0.9370.937	1.18E+061.18E+06	0.3910.391	1.3391.339	1.87E+071.87E+07
0.3710.371	1.3081.308	1.17E+061.17E+06	0.3920.392	0.9490.949	1.84E+071.84E+07
0.3720.372	0.9380.938	1.72E+061.72E+06	0.3920.392	1.3411.341	1.83E+071.83E+07

0.3720.372	1.311.31	1.71E+061.71E+06	0.3930.393	0.950.95	1.84E+071.84E+07
0.3730.373	0.9390.939	3.74E+063.74E+06	0.3930.393	1.3431.343	1.83E+071.83E+07
0.3730.373	1.3121.312	3.73E+063.73E+06	0.3940.394	0.9510.951	1.80E+071.80E+07
0.3740.374	0.940.94	5.34E+065.34E+06	0.3940.394	1.3451.345	1.79E+071.79E+07
0.3740.374	1.3141.314	5.31E+065.31E+06	0.3950.395	0.9520.952	1.75E+071.75E+07
0.3750.375	0.9410.941	7.56E+067.56E+06	0.3950.395	1.3471.347	1.74E+071.74E+07
0.3750.375	1.3161.316	7.52E+067.52E+06	0.3960.396	0.9530.953	1.63E+071.63E+07
0.3760.376	0.9420.942	8.71E+068.71E+06	0.3960.396	1.3491.349	1.62E+071.62E+07
0.3760.376	1.3181.318	8.67E+068.67E+06	0.3970.397	0.9540.954	1.53E+071.53E+07
0.3770.377	0.9430.943	1.08E+071.08E+07	0.3970.397	1.3511.351	1.52E+071.52E+07
0.3770.377	1.321.32	1.07E+071.07E+07	0.3980.398	0.9550.955	1.34E+071.34E+07
0.3780.378	0.9440.944	1.58E+071.58E+07	0.3980.398	1.3531.353	1.33E+071.33E+07
0.3780.378	1.3221.322	1.57E+071.57E+07	0.3990.399	0.9560.956	1.17E+071.17E+07
0.3790.379	0.9450.945	1.69E+071.69E+07	0.3990.399	1.3551.355	1.17E+071.17E+07
0.3790.379	1.3241.324	1.68E+071.68E+07	0.40.4	0.9570.957	8.63E+068.63E+06
0.380.38	0.9460.946	1.78E+071.78E+07	0.40.4	1.3571.357	8.57E+068.57E+06
0.380.38	1.3261.326	1.77E+071.77E+07	0.4010.401	0.9580.958	6.79E+066.79E+06
0.3810.381	0.9470.947	1.88E+071.88E+07	0.4010.401	1.3591.359	6.74E+066.74E+06
0.3810.381	1.3281.328	1.87E+071.87E+07	0.4020.402	0.9590.959	1.77E+061.77E+06
0.3820.382	0.9480.948	1.93E+071.93E+07	0.4020.402	1.3611.361	1.76E+061.76E+06
0.3820.382	1.331.33	1.92E+071.92E+07	0.4030.403	0.960.96	831069831069
0.3830.383	0.9480.948	1.92E+071.92E+07	0.4030.403	1.3631.363	826993826993
0.3830.383	1.3311.331	1.91E+071.91E+07	0.4040.404	0.9610.961	412408412408
0.3840.384	0.9480.948	1.92E+071.92E+07	0.4040.404	1.3651.365	411139411139
0.3840.384	1.3321.332	1.91E+071.91E+07	0.4050.405	0.9620.962	351299351299
0.3850.385	0.9480.948	1.88E+071.88E+07	0.4050.405	1.3671.367	350311350311
0.3850.385	1.3331.333	1.87E+071.87E+07	0.4060.406	0.9630.963	352988352988
0.3860.386	0.9480.948	1.78E+071.78E+07	0.4060.406	1.3691.369	352002352002

XI. The results in step 10 show that when P is 0.382, the mixed model takes the maximum value, and the Q at this time is 0.948. According to this, the purity of the cancer sample can be calculated to be 0.80, and the chromosome ploidy of the cancer cell is 2.14.

Claims

A method for calculating cancer cell purity and ploidy ploidy in a cancer sample, the method comprising the steps of:

Step A:

Obtaining whole genome sequencing data of paired cancer tissue samples and normal tissue samples, and comparing the sequencing data to a reference genome;

Step B:

From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):

In formula (1.1), n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;

Step C:

According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;

Step D:

Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:

In formula (1),
with
Respectively indicates the number of reads covering the segment s in the cancer sample and the number of reads covering the segment s in the normal sample, N t represents the total number of reads of the cancer sample, N n represents the total number of reads of the corresponding normal sample, and e s is the TRE value;

Step E:

Taking the genomic fragments processed by BIC-seq in step D as a unit, the mean value, variance and the number of windows in the window of all the windows in the segment are counted, and the number of windows of each segment of the genome is smoothed according to the mean and variance. The distribution of TRE is more uniform, and then the window distribution of all fragments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean and variance of MAF of all HGSNVs in the fragment are calculated in units of fragments;

Step F:

Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, which traverses a certain range of P, and calculates Y(P) in Y(P). In the distribution of the ), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:

In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) represents the number of windows corresponding to the position where TRE is X t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;

Step G:

According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q n , where step G includes:

G1:

According to the P calculated in step F, using formula (13.1), select X f which takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:

In formula (13.1), i represents the ith peak, C(X f + P × i) represents the position where TRE is X f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M t , M t represents the maximum value of TRE;

G2:

Using formula (13.2), calculate the maximum number of peaks N that may exist before X f based on P calculated in step F and X f calculated in step G1:

In formula (13.2), X f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;

G3:

Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q n is calculated using equation (13.3):

Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3)

In formula (13.3), n represents the number of peaks before X f , the value ranges from 0 to N, P represents the spacing between peaks of adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, Q n represents the Q value when there are theoretically n peaks before X f ;

Step H:

Using the P calculated in step F and the Q n calculated in step G, the cancer sample purity γ and chromosome ploidy k are calculated using equations (10), (11):

In the formulas (10) and (11), γ represents the purity of the sample, and k represents the ploidy of the chromosome, whereby the corresponding (γ, k) is obtained for (P, Q N );

Step I:

When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:

T i =X f -n×P+i×P=X f +(in)×P,n∈[0,N] (13.4)

In formula (13.4), n represents the number of peaks before X f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, T i represents the TRE mean of the ith peak,

For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity γ and peak, calculate the expected f b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, the desired set of {F b } of the MAF is finally obtained; Peak TRE mean and variance or standard deviation;

Step J:

According to P calculated in step F and {f b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J includes the following steps:

J1:

Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:

In equation (17), L(e s ; γ, k) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e s is the TRE observation value of the sth window, and S i represents the TRE mean value of the i th peak, that is, T i in step I, p i represents the first The copy number of s windows is the weight of i, and the value of all i, p i is 1;

J2:

Construct a Gaussian distribution model as shown in equation (18) with f b calculated in step I:

In equation (18), L(f s ; γ, k) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E ;σ i,j denotes the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p i,j denotes the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1, p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p i is 1;

J3:

Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC (Bayesian Information Criterion) correction on the mixed model to obtain the final mixed model as Equation (19):

BIC(e s ,f s ;γ,k)=-2×log L(f s ;γ,k)-2×log L(e s ;γ,k)+I×log(N)+J×log (M) (19)

In the formula (19), BIC(e s , f s ; γ, k) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.

For each integer value n in the range [0, N], Q n is obtained by step G, or a set of MAF expected {f b } of all peaks is obtained by step I, by a pair (P, {f b }) Construct a model as shown in equation (19);

Step K:

Repeat steps G to J for all P values in the [Pm, P+m] interval with a resolution of 0.001, and obtain a series of different (P, Q n ) and corresponding likelihood function values, taking the maximum likelihood. The function value corresponds to (P, Q n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;

Step L:

The results of step H are queried, and the corresponding cancer sample purity and chromosome ploidy are found under (P, Q) obtained in step K.
An apparatus for calculating cancer cell purity and chromosome ploidy in a cancer sample, comprising a processor for running a program, the program running to perform the following steps:

Step A:

Obtaining whole genome sequencing data of paired cancer tissue samples and normal tissue samples, and comparing the sequencing data to a reference genome;

Step B:

From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):

In formula (1.1), n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;

Step C:

According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;

Step D:

Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:

In formula (1),
with
Respectively indicates the number of reads covering the segment s in the cancer sample and the number of reads covering the segment s in the normal sample, N t represents the total number of reads of the cancer sample, N n represents the total number of reads of the corresponding normal sample, and e s is the TRE value;

Step E:

Taking the genomic fragments processed by BIC-seq in step D as a unit, the mean value, variance and the number of windows in the window of all the windows in the segment are counted, and the number of windows of each segment of the genome is smoothed according to the mean and variance. The distribution of TRE is more uniform, and then the window distribution of all fragments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean and variance of MAF of all HGSNVs in the fragment are calculated in units of fragments;

Step F:

Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, which traverses a certain range of P, and calculates Y(P) in Y(P). In the distribution of the ), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:

In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) represents the number of windows corresponding to the position where TRE is X t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;

Step G:

According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q n , where step G includes:

G1:

According to the P calculated in step F, using formula (13.1), select X f which takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:

In formula (13.1), i represents the ith peak, C(X f + P × i) represents the position where TRE is X f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M t , M t represents the maximum value of TRE;

G2:

Using formula (13.2), calculate the maximum number of peaks N that may exist before X f based on P calculated in step F and X f calculated in step G1:

In formula (13.2), X f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;

G3:

Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q n is calculated using equation (13.3):

Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3) In the formula (13.3), n represents the number of peaks before X f , The value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, X f represents the TRE mean of the first actually observed peak, and Q n represents the theoretical existence before X f Q value at n peaks;

Step H:

Using the P calculated in step F and the Q n calculated in step G, the cancer sample purity γ and chromosome ploidy k are calculated using equations (10), (11):

In the formulas (10) and (11), γ represents the purity of the sample, and k represents the ploidy of the chromosome, whereby the corresponding (γ, k) is obtained for (P, Q N );

Step I:

When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:

T i =X f -n×P+i×P=X f +(in)×P,n∈[0,N] (13.4)

In formula (13.4), n represents the number of peaks before X f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, T i represents the TRE mean of the ith peak,

For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity γ and peak, calculate the expected f b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, the desired set of {F b } of the MAF is finally obtained; Peak TRE mean and variance or standard deviation;

Step J:

According to P calculated in step F and {f b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J includes the following steps:

J1:

Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:

In equation (17), L(e s ; γ, k) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e s is the TRE observation value of the sth window, and S i represents the TRE mean value of the i th peak, that is, T i in step I, p i represents the first The copy number of s windows is the weight of i, and the value of all i, p i is 1;

J2:

Construct a Gaussian distribution model as shown in equation (18) with f b calculated in step I:

In equation (18), L(f s ; γ, k) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E , σ i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1, p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p i is 1;

J3:

Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC correction on the mixed model to obtain the final mixed model as Equation (19):

BIC(e s ,f s ;γ,k)=-2×log L(f s ;γ,k)-2×log L(e s ;γ,k)+I×log(N)+J×log (M) (19)

In the formula (19), BIC(e s , f s ; γ, k) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.

For each integer value n in the range [0, N], Q n is obtained by step G, or a set of MAF expected {f b } of all peaks is obtained by step I, by a pair (P, {f b }) Construct a model as shown in equation (19);

Step K:

Repeat steps G to J for all P values in the [Pm, P+m] interval with a resolution of 0.001, and obtain a series of different (P, Q n ) and corresponding likelihood function values, taking the maximum likelihood. The function value corresponds to (P, Q n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;

Step L:

The results of step H are queried, and the corresponding cancer sample purity and chromosome ploidy are found under (P, Q) obtained in step K.
The method according to claim 1 or the apparatus according to claim 2, wherein in the step A, the reference genome hs37d5 (ftp://ftp.1000genomes) used in the 1000th genome phase 3 (phase 3) project is used. .ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz) as the reference genome; and/or, the alignment software uses Burrows-Wheeler Aligner (BWA), the alignment method is used Among them, bwa mem, the final result of the comparison of cancer and normal samples is the bam format file.
The method according to claim 1 or the device according to claim 2, wherein in the step B, the position and length information of the read, the location of the HGSNV and the information on the number of reads covering the location are extracted by using samtools software. When the read information is extracted by using the samtools view command, the sequence with the sequence alignment quality (MAPQ) lower than 31 is filtered out by using the parameter -q 31, where q indicates that the sequence with poor sequencing quality is filtered out, and the parameter -f 0x2-F is used. 0x18 filters out the read that fails to match correctly, where f indicates that the sequence meets the requirements, F indicates that the filtering meets the required sequence, and when the samtools mpileup command is used to extract the HGSNV information, the parameter -q 20 is used to filter out the sequence alignment. Sequence of 20, and using the parameter -Q 20 to filter out sequences with base quality less than 20, where Q indicates that the sequence with poor base quality is filtered out; when selecting the allele frequency, use the -l parameter of samtools mpileup; The parameter needs to prepare a bed format file containing SNP location information in advance.
The method according to claim 1 or the device according to claim 2, wherein

The step C includes 4 steps:

C1, the whole genome is divided according to a window of a certain base length, and the number of reads of the window is covered for each window statistics, and the position of the read is represented by the midpoint of each read in the statistics;

C2. Create an index file for the reference genome to increase the statistical speed of the GC content;

C3, taking the GC content of each window as an independent variable, taking the number of reads of each window as the dependent variable, and fitting the function of the read quantity with the GC content;

C4. Adjust the number of whole genome reads using the fitted model.
The method or apparatus according to claim 5, wherein in the step C2, a GC content index file is created for the reference genome, and each of the chromosomes is counted in an area of 1, 5, 25, 125 base intervals, the bird The cumulative number of 嘌呤(G) and cytosine (C), which is extracted by a fast algorithm of a*125+b*25+c*5+d*1 when counting the GC content in a window, where a , b, c, d represent coefficient variables.
The method or apparatus according to claim 5, wherein in the step C3, using the GC content of each window extracted by the step C1 and the step C2, the amount of read is fitted with the GC content by the following elastic network model, wherein, The GC content of window is variable x, using x, x 2 , x 3 , x 4 , x 5 , x 6 as the input variables of the elastic network model, and the number of reads as the output variable to construct the elastic network model as shown in equation (20). Show:

In equation (20), y represents the number of reads observed in the window, X represents the input variable matrix, β represents the variable coefficient matrix, j represents the variable coefficient subscript, P represents the total number of coefficients, and λ 1 and λ 2 represent the penalty coefficients.
The method or apparatus according to claim 5, wherein in the step C4, the theoretical number of read μ gc of each window is predicted using the model in step C3, and the average GC content of the genome is defined as μ, which is observed in the window. The number of reads is defined as y, and the number of corrected reads in window is Y, then the correction formula is as follows (21):
The method according to claim 1 or the device according to claim 2, wherein in the step E, the number of windows included in the segment is calculated using the genomic segment processed by BIC-seq in step D, and the TRE is The mean value and the variance are then smoothed by the TRE of the segment. The processing is as shown in equation (22). For each genomic segment, the mean value of TRE is taken as the mean μ of the normal distribution, and the variance of TRE is taken as the normal state. The variance σ of the distribution is calculated as the distribution of the window number of the TRE in the range of [μ-2σ, μ+2σ], and the definition v is the TRE coordinate, the value range is [μ-2σ, μ+2σ], and the resolution is 0.001. C win is the number of windows allocated to the v-site, and C T is the total number of windows in the segment. After all the windows of the segment are smoothed according to the TRE value, the number of windows in the segment is normally distributed, for all The sum of the window numbers corresponding to each TRE site of the fragment is summed to obtain the distribution of the genome-wide window with the TRE change:
The method according to claim 1 or the apparatus according to claim 2, wherein in step F, all Ps in the range [0, 1] are traversed at a resolution of 0.001, and an autoregressive model is used to calculate The value of Y(P), Y(P) is expressed as a multimodal distribution, and P corresponding to the maximum value of Y(P) in the second peak is used as the calculation result of P, and M t is the maximum value of TRE, where M is t is set to 3.
The method according to claim 1 or the device according to claim 2, wherein in the step G, the step G comprises three steps, in the step G1, traversing the TRE interval of [0, 1] as X f , filtering Drop the TRE site where C(X f ) is less than 1000, and calculate the X f when the maximum value of the formula (13.1) is taken as the mean value of the first actual observed peak.
The method according to claim 1 or the apparatus according to claim 2, wherein in the step I, the expected f b of the MAF of the peak is calculated according to the copy number corresponding to the cancer sample purity γ and peak calculated in the step H. Where step I includes:

I1, using formula (14) to calculate the theoretical MAF value of HGSNV in peak:

In formula (14), C mcp represents the copy number of the major allele, C cp represents the overall copy number of peak, which is obtained from step I, and f represents the theoretical value of MAF in the peak. It can be seen that when C cp is large, f There are many different possible values;

I2, using the negative binomial distribution to estimate the probability of covering the total number of reads per HGSNV site, and using equation (15) to calculate the probability p and the number of failures r of the negative binomial distribution:

In equation (15), m is the mean of the number of reads in all windows in the peak, and v is the variance of the number of reads in all windows in the peak. The obtained p is the probability of success of the random variable used for the negative binomial distribution, r For the number of times a random variable fails, the random variable is the number of reads in a certain HGSNV;

I3, using the binomial distribution to obtain the probability of covering a certain HGSNV read number, combined with a certain number of read, HGSNV has only two genotypes, obey the binomial distribution law, and use formula (16) to calculate the correction value f of f B, with a peak in a different calculated C mcp different f b, and selecting the peak observed MAF closest to the mean of the peak as a f b f b:

In formula (16), k represents the number of alleles A or B at a certain HGSNV site, d is the number of reads covering the HGSNV, r is the number of failed random variables, and p is used for negative two. The probability of success of the random variable of the item distribution;

For each Q n , it can be inferred that the copy number and cancer sample purity corresponding to all peaks of the genome are obtained, so that f b is obtained for each peak, and then the expected value set {f b } of the MAF of all peaks is obtained.
The method according to claim 1 or the apparatus according to claim 2, wherein, in the step K, m is taken as 0.02 as the P-worth traversal interval is [P-0.02, P+0.02].