WO2018201805A1 - Method and device for use in calculating cancer sample purity and chromosome ploidy - Google Patents

Method and device for use in calculating cancer sample purity and chromosome ploidy Download PDF

Info

Publication number
WO2018201805A1
WO2018201805A1 PCT/CN2018/078908 CN2018078908W WO2018201805A1 WO 2018201805 A1 WO2018201805 A1 WO 2018201805A1 CN 2018078908 W CN2018078908 W CN 2018078908W WO 2018201805 A1 WO2018201805 A1 WO 2018201805A1
Authority
WO
WIPO (PCT)
Prior art keywords
tre
value
peak
genome
calculated
Prior art date
Application number
PCT/CN2018/078908
Other languages
French (fr)
Chinese (zh)
Inventor
黄宇
罗志辉
苏瑶
范新平
Original Assignee
中国科学院上海药物研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院上海药物研究所 filed Critical 中国科学院上海药物研究所
Publication of WO2018201805A1 publication Critical patent/WO2018201805A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention is in the field of cancer research, and in particular relates to a method and apparatus for calculating cancer cell purity and intracellular chromosome ploidy in a cancer sample.
  • Cancer research is an important research area in life medicine and has a major impact on human health. Cancer is a kind of malignant proliferation of cells. Because of its complicated pathology, humans cannot overcome such diseases. Second generation sequencing provides the possibility to quickly detect patient genetic information. However, sequencing requires the extraction of samples from patient tissue, but usually cancer tissue does not simply contain cancer cells, it also has a very rich microenvironment. Cancer cell microenvironment refers to the environment of non-cancerous cells surrounding or accompanying cancer cells. When the cancer cell samples are extracted, these microenvironments are extracted together with the cancer cells and sequenced along with the cancer cells [1]. The proportion of cancer cells in a cancer sample is defined as the purity of the cancer sample.
  • the cancer genome usually contains a large number of somatic cell copy number variations, which are mainly caused by amplification or deletion of genomic fragments. Identifying changes in the copy number of genomic fragments of a particular tumor genome is an important topic in cancer genome research. Accurate identification of genomic fragment copy number has certain challenges, because the cancer fragment copy number is mainly determined by a mixture of two factors, one is the purity of cancer samples, that is, the proportion of cancer cells in cancer samples, and the other is chromosome ploidy [2, 3 ]. Traditional methods for identifying cancer sample purity and ploidy are using experimental techniques such as quantitative image analysis [4] or single cell sequencing [5]. But in large projects, such an approach can cost a lot of manpower, money, and time. With the development of sequencing technology, the rapid growth of sequencing data, and the accumulation of sequencing data analysis techniques, various cancer sample purity algorithms have been proposed and corresponding software has been developed.
  • PyLOH [11] used frequency information of hybrid SNV (single nucleotide variation) sites on the genome, and copy number of genomic fragments. PyLOH solves the problem of “difficulties in identification” to a certain extent, and can give a more reasonable solution. However, its accuracy is poor, especially in the case of subclone in the genome. Patchwork uses both types of information, but in the intermediate steps of calculating genotypes, manual identification is required, the results of manual judgments lack accuracy, and such semi-automated software brings a lot of inconvenience to the application.
  • Sequencing depth The ratio of the total number of bases (bp) obtained by sequencing to the size of the genome (Genome), which is one of the indicators for evaluating the amount of sequencing.
  • Window A genomic fragment divided by a length that represents the size of the window.
  • the window size can be freely set by the user, and is usually set to several hundred bases.
  • a large genomic fragment S can contain a large number of windows.
  • Tumor Read Enrichment Cancer long fragment reads the enrichment e s, refers to the ratio of the number of read number of segments within a cancerous samples read with a corresponding segment S corresponding normal sample, the following formula is defined:
  • N t represents the total number of reads of the cancer sample whole genome sequencing
  • N n represents the corresponding normal sample whole genome sequencing obtained read The total number.
  • Heterozygous Germline Single Nucleotide Variants Single base variation in heterozygous germline cells. Since human chromosomes are diploid, somatic cells are derived from embryonic cells, while HGSNV sites in germ cells have only two base types. A and B, one of which comes from the father and the other from the mother.
  • Major Allele Fraction The major allele score.
  • the HGSNV used in the present invention has only two alleles, one allele is identical to the reference genome and the other is different from the reference genome.
  • the two allelic scores are calculated by covering the number of reads of an allele divided by the ratio of the total number of reads covering the site, and MAF is the larger of the two allelic scores.
  • the calculation formula is as shown in (1.1), where n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total read covering the HGSNV site.
  • the quantity, C is the MAF value of the HGSNV.
  • MAF is a concept relative to HGSNV.
  • “MAF of a fragment” refers to the MAF mean of all HGSNVs in a fragment
  • MAF of peak refers to the MAF mean of HGSNV contained in all fragments in a peak.
  • Major allele copy number the copy number of the major allele, which refers to the value of the copy number of the major allele in the fragment with copy number i, which is greater than or equal to The integer.
  • Peak refers to the TRE clusters that are clustered together in the TRE distribution of all windows in the genome.
  • Figure A shows the TRE distribution of all windows on the genome, and the vertical axis shows the total number of windows corresponding to a TRE site.
  • the figure shows the TRE distribution before the genomic GC content correction.
  • Figure B shows the GC content correction.
  • the TRE distribution, in Figure B, can be seen that the window is obviously clustered.
  • This method defines the TRE cluster identified by the autoregressive model as peak, which is essentially the aggregation of the window within the genome segment with the same copy.
  • Cancer sample A cancer tissue taken from an individual with cancer, which contains a portion of cancer cells and a portion of normal cells.
  • P refers to the spacing between two adjacent peaks. Since peak is a cluster, the TRE of peak here is represented by the TRE mean of peak, so it is actually the difference of the TRE mean of two adjacent peaks. Since peak is an aggregation of windows within the same copy number genome segment, it is also expressed here as the difference of adjacent copy number segments TRE.
  • the object of the present invention is to overcome the deficiencies of the prior art and to provide a fully automatic, high efficiency, high accuracy method and apparatus for calculating cancer sample purity and chromosome ploidy.
  • the invention has broad application prospects in the calculation of cancer sample purity and chromosome ploidy.
  • the technical solution adopted by the present invention is to construct a mixed Gaussian model of the MAF distribution of TRE and HGSNV of different copy number fragments by using whole genome sequencing data of cancer samples and matched normal samples, and calculate the purity of the cancer sample and Chromosome ploidy.
  • the invention mainly uses the TRE information of the whole genome sequencing data and the MAF information of the HGSNV.
  • TRE basically reflects the copy number variation of cancer samples
  • MAF information of HGSNV basically reflects the genotype of cancer samples.
  • the difference in TRE is mainly due to the difference in copy number of genomic fragments.
  • the number of reads obtained by sequencing in high copy number genomic fragments must be greater than the number of reads obtained by sequencing of low copy number genomic fragments.
  • the difference in copy number of fragments is calculated by the difference in the number of reads in the fragments. Common methods in genome copy number detection. However, in most studies, the difference in the number of reads in the cancer sample segment was divided by the ratio of the number of reads in the normal sample to calculate the difference in the number of reads.
  • the present invention uses the TRE shown in the formula (1) to evaluate the difference in the number of reads of different segments of the slice.
  • the calculated ratio of the traditional method is not only affected by the purity and chromosome ploidy of the cancer sample, but also by the depth of sequencing of the cancer sample and the normal sample, and the TRE is not affected by the depth of the sample sequencing.
  • the genotype of each copy number fragment cannot be determined by relying solely on the difference in the number of reads, and more importantly, the compensation effect of sample purity and sample ploidy cannot be distinguished.
  • HGSNV combined with copy number difference fragments can provide genotype information and help to solve the compensation effect of purity and ploidy.
  • the genotypes that may correspond to different copy number fragments are listed one by one, and then the results of the permutation and combination are calculated to select the most reliable results.
  • the common feature of these methods is that the method has a long calculation time and poor accuracy, and the sample with high copy number or large genomic variation has a poor effect.
  • the invention calculates the purity and chromosome ploidy of the cancer sample according to the mixed Gaussian model of MAF and TRE of HGSNV, can significantly reduce the calculation time and improve the accuracy of the calculation result.
  • the proportion of normal cells in the cancer sample is 1- ⁇ .
  • the normal cell has a chromosome ploidy of 2 in the cancer sample, and the cancer cell has a chromosome ploidy of ⁇ .
  • the chromosome ploidy ⁇ of the cancer sample is as shown in the formula (2).
  • the TRE is calculated as shown in equation (1).
  • the derivation formula of the expectation E(e s ) of TRE is as shown in the following formula (4), N n and N t have the same meanings as in formula (1).
  • the method defines a parameter of some help understanding.
  • Length of fragment S L S length of human reference genome L gw , depth of sequencing of cancer samples
  • Sampling depth of normal samples Then the fragment S is sequenced in a cancer sample to a depth of Fragment S is sequenced in a normal sample to a depth of ⁇ S refers to a parameter related to the characteristics of the segment S (such as the GC content and the like which cause the depth of preference of the sequencing), so it is the same in cancer and normal samples.
  • e s is represented by ⁇ , ⁇ , C s as shown in the formula (5).
  • C s represents the copy number of the fragment S in the cancer cell
  • the corresponding TRE mean S i and S i+1 when the copy number of the fragment S is i and i+1 are as shown in the formula (6), respectively.
  • the cancer sample purity ⁇ and chromosome ploidy ⁇ can be calculated by determining P and Q.
  • the peak distribution of the peak shown in Figure B is because the TRE value of the genomic fragment with the same copy number (the mean value of the TRE of all windows in the fragment) is not completely equal, and the copy number fragment TRE is mutually There is an error between them. This error obeys a Gaussian distribution, so the clustered distribution in Figure B is considered to be a Gaussian distribution.
  • the Q site represents the TRE value corresponding to the segment of copy number 2.
  • the number of windows corresponding to the distance P and the distance 2P before the Q site is 0.
  • the distribution map of TRE for X f , that is, the first peak that appears, it may correspond to a copy number of 2 (the copy corresponding to the peak of copy number 1 and 0 is 0), and the corresponding copy number may be 1 (The fragment with a copy number of 0 corresponds to the window of the peak is 0.) It is also possible that the corresponding copy number is 0.
  • the copy number of the first peak in Figure 2 that is, the X f corresponding segment has several different possibilities, and each of them may make Q correspond to a different peak.
  • the present invention calculates the copy number of the most likely X f corresponding segment by mixing the Gaussian model, thereby determining the value of Q, and finally obtaining the purity and chromosome ploidy of the cancer sample.
  • the present invention determines the value of X f by the following formula (13.1).
  • C(X f +P) represents the number of windows corresponding to the position where TRE is X f + P, and n represents the maximum number of peaks within M t .
  • f(X f ) takes the maximum value, X f is the TRE mean of the first peak.
  • Equation (13.2) uses Equation (13.2) to find up to a maximum of several peaks before X f .
  • X f represents a mean peak of TRE
  • P denotes the distance between the copy number of segments corresponding to an adjacent peak
  • floor represents rounded down
  • N 0, indicating no prior peak X f
  • X f corresponds
  • a corresponding Q value can be calculated by the following formula (13.3).
  • Q is the TRE value corresponding to the peak of the fragment with a copy number of 2.
  • the TRE value of the fragment with a copy number of 0 is X f -n ⁇ P, where n represents the number of peaks before X f , the value ranges from 0 to N, and P represents the adjacent copy number.
  • the spacing between the peaks corresponding to the segments, the meaning of X f is the same as in the formula (13.1), and the formula (13.3) is as follows, where Q n represents the value of Q when there are n peaks before X f .
  • the MAF actually obeys the binomial distribution with f as the probability and d as the number of experiments.
  • the present invention corrects f by the following formula (16) to obtain the expected value f b of the MAF.
  • k represents the number of alleles (A or B) at a certain HGSNV site, and the total amount of alleles measured is d (equal to the sequencing depth).
  • Equation (13.2) indicates that there are N possibilities for the copy number of each genomic fragment.
  • Equation (14) shows that a certain copy number fragment can have multiple major allele copy numbers, so for each genome peak, multiple f can be calculated, and multiple f b can be calculated, and the average observed MAF value is taken from the distance peak. The most recent f b is expected as the MAF of the peak. The whole genome has multiple peaks, and the MAF expectation of each peak is different, corresponding to calculating multiple MAF expected values ⁇ f b ⁇ . Considering that there is a certain error in the MAF of HGSNV in a certain peak, but also obeys the Gaussian distribution, the expected value of MAF of all HGSNV in peak can be directly calculated from the actual data.
  • the copy number and genotype of the peak can be judged by comparing the MAF observation of peak with the value of ⁇ f b ⁇ of peak.
  • the TRE value corresponding to the peak of the copy number 2 that is, the position of Q can be calculated.
  • the present invention proposes a hybrid Gaussian model to fit the observation data of TRE and HGSNV.
  • Equation 17 The Gaussian distribution model of TRE is shown in Equation 17:
  • L(e s ; ⁇ , ⁇ ) represents the likelihood function of the genomic fragment TRE.
  • N represents the number of all windows on the genome.
  • I represents the largest copy number of all fragments in the genome.
  • ⁇ i represents the standard deviation of the TRE of all segments of copy number i.
  • e s is the TRE observation of the sth window, and
  • S i represents the TRE mean of the ith peak.
  • p i represents the weight of the copy number of the sth window, i, and all i, p i in this formula take the value 1.
  • the formula shows that the size of the likelihood function is related to the value of S i . When S i and e S are closer, the value of the likelihood function is larger, and the closer the P value is to the true value. A reasonable P value can be calculated using the maximum likelihood estimation of L(e s ; ⁇ , ⁇ ).
  • the present invention further determines the P value and determines the Q value by combining the Gaussian distribution model of HGSNV as shown in the formula (18).
  • L(f s ; ⁇ , ⁇ ) represents the likelihood function of HGSNV.
  • M represents all HGSNV in the genome.
  • S represents the Sth HGSNV.
  • I represents the largest copy number of all fragments in the genome.
  • F i,j is the expected value of MAF of HGSNV in the fragment whose copy number is i and the copy number of the major allele is j, that is, f b calculated by the formula (16).
  • f s represents the mean value of the observed values of the MAF within the segment
  • ⁇ i,j represents the standard deviation of the MAF observations of the HGSNV within the segment.
  • Equation (18) shows that the size of the likelihood function is related to the value of F i,j .
  • F i,j is closer to f s , the larger the likelihood function, the more accurate f s is , and the formula (14) is also shown. The more accurate f is, the corresponding C cp and C mcp of each fragment can be obtained. Then the value of Q is determined.
  • the method adds equation (17) and formula (18) to obtain a mixed Gaussian model.
  • the present invention further uses a Bayesian Information Criterion (BIC) method to give a mixed Gaussian model a penalty function for controlling over-fitting of the model.
  • BIC Bayesian Information Criterion
  • BIC(S s , f s ; ⁇ , ⁇ ) represents the likelihood function of the mixed model
  • I is the number of Gaussian distributions in equation (17)
  • J is the number of Gaussian distributions in equation (18)
  • N is the number of windows in the genome
  • M is the number of HGSNVs in the genome.
  • one aspect of the invention provides a method for calculating cancer cell purity and ploidy ploidy in a cancer sample, the method comprising the steps of:
  • step A From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):
  • n r is the number of reads containing the same allele as the reference genome
  • n a is the number of reads containing another allele
  • n t is the total number of reads covering the HGSNV site
  • C The MAF value of the HGSNV
  • the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;
  • the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:
  • the mean, variance, and number of windows in the window of all windows in the segment are counted, and the number of windows per segment of the genome is smoothed according to the mean and variance (smooth Processing, the distribution of TRE is more uniform, and then the window distribution of all segments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean value of MAF of all HGSNVs in the segment is calculated in units of fragments. variance;
  • Interval C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000 ⁇ P ) represents the number of windows corresponding to the position where TRE is X t+1000 ⁇ P ;
  • Y(P) represents the function value of the auto-regressive model under the variable P;
  • step F calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak.
  • the value of Q is calculated, denoted by Q n , where step G may include:
  • i represents the ith peak
  • C(X f + P ⁇ i) represents the position where TRE is X f + P ⁇ i, the corresponding number of windows
  • n represents the maximum number of peaks within M t
  • M t represents the maximum value of TRE
  • X f represents the mean of the first peak
  • P represents the spacing between peaks corresponding to adjacent copy number segments
  • floor represents an integer below
  • n represents the number of peaks before X f , the value ranges from 0 to N, P represents the spacing between peaks of adjacent copy number segments, and X f represents the first actual observation.
  • the TRE mean of peak, Q n represents the Q value when there are theoretically n peaks before X f ;
  • represents the purity of the sample, and ⁇ represents the ploidy of the chromosome, so that the corresponding ( ⁇ , ⁇ ) can be obtained for all (P, Q N );
  • n represents the number of peaks before X f
  • the value range is an integer between 0 and N
  • P represents the spacing between peaks corresponding to adjacent copy number segments
  • X f represents the first actual observation.
  • the TRE mean of peak T i represents the TRE mean of the ith peak
  • the fragment For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity ⁇ and peak, calculate the expected f b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, finally obtain the desired set of MAF ⁇ f b ⁇ ; TRE mean and variance (or standard deviation) for each peak;
  • Step J can include the following steps:
  • L(e s ; ⁇ , ⁇ ) represents the likelihood function of the genomic fragment TRE
  • N represents the number of all windows on the genome
  • I represents the maximum copy number of all fragments in the genome
  • ⁇ i represents a copy
  • the standard deviation of the TRE of all segments of number i is obtained by step I
  • e s is the TRE observation value of the sth window
  • S i represents the TRE mean value of the i th peak, that is, T i in step I
  • p i represents the first
  • the copy number of s windows is the weight of i, and the value of all i, p i is 1;
  • L(f s ; ⁇ , ⁇ ) represents the likelihood function of HGSNV
  • M represents the number of all HGSNVs in the genome
  • S represents the Sth HGSNV
  • I represents the maximum copy number of all fragments in the genome
  • i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I
  • f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E
  • ⁇ i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E
  • p i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1
  • p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all
  • BIC(e s , f s ; ⁇ , ⁇ ) represents the likelihood function of the mixed model
  • I represents the maximum copy number of all the fragments in the genome
  • J is the value of j in the formula (18).
  • Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
  • Q n can be obtained by step G, or the desired set of MAFs of all peaks ⁇ f b ⁇ can be obtained by step I, and a pair (P, ⁇ f b ⁇ ) It is possible to construct a model shown in equation (19), essentially for each pair (P, Q n ), to construct a model as shown in equation (19);
  • step H the corresponding cancer sample purity and chromosome ploidy can be found under (P, Q) obtained in step K.
  • the invention provides an apparatus for calculating cancer cell purity and chromosome ploidy in a cancer sample, comprising a processor for running a program, the program running the following steps:
  • step A From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):
  • n r is the number of reads containing the same allele as the reference genome
  • n a is the number of reads containing another allele
  • n t is the total number of reads covering the HGSNV site
  • C The MAF value of the HGSNV
  • the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;
  • the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:
  • the mean, variance, and number of windows in the window of all windows in the segment are counted, and the number of windows per segment of the genome is smoothed according to the mean and variance (smooth Processing, the distribution of TRE is more uniform, and then the window distribution of all segments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean value of MAF of all HGSNVs in the segment is calculated in units of fragments. variance;
  • Interval C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000 ⁇ P ) represents the number of windows corresponding to the position where TRE is X t+1000 ⁇ P ;
  • Y(P) represents the function value of the auto-regressive model under the variable P;
  • step F calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak.
  • the value of Q is calculated, denoted by Q n , where step G may include:
  • i represents the ith peak
  • C(X f + P ⁇ i) represents the position where TRE is X f + P ⁇ i, the corresponding number of windows
  • n represents the maximum number of peaks within M t
  • M t represents the maximum value of TRE
  • X f represents the mean of the first peak
  • P represents the spacing between peaks corresponding to adjacent copy number segments
  • floor represents an integer below
  • n represents the number of peaks before X f
  • the value range is an integer between 0 and N
  • P represents the spacing between peaks corresponding to adjacent copy number segments
  • X f represents the TRE mean of the first actually observed peak
  • Q n represents the theoretical existence before X f Q value at n peaks;
  • represents the purity of the sample, and ⁇ represents the ploidy of the chromosome, so that the corresponding ( ⁇ , ⁇ ) can be obtained for all (P, Q N );
  • n represents the number of peaks before X f
  • the value range is an integer between 0 and N
  • P represents the spacing between peaks corresponding to adjacent copy number segments
  • X f represents the first actual observation.
  • the TRE mean of peak T i represents the TRE mean of the ith peak.
  • the fragment For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculated the copy number of the cancer sample purity ⁇ and peak, the expected f b of the MAF of peak can be calculated, the MAF of different peaks is expected to be different, and for all peaks on the genome, the desired set of MAF ⁇ f b ⁇ ; Calculate the TRE mean and variance (or standard deviation) of each peak at the same time;
  • Step J can include the following steps:
  • L(e s ; ⁇ , ⁇ ) represents the likelihood function of the genomic fragment TRE
  • N represents the number of all windows on the genome
  • I represents the maximum copy number of all fragments in the genome
  • ⁇ i represents a copy
  • the standard deviation of the TRE of all segments of number i is obtained by step I
  • e s is the TRE observation value of the sth window
  • S i represents the TRE mean value of the i th peak, that is, T i in step I
  • p i represents the first
  • the copy number of s windows is the weight of i, and the value of all i, p i is 1;
  • L(f s ; ⁇ , ⁇ ) represents the likelihood function of HGSNV
  • M represents the number of all HGSNVs in the genome
  • S represents the Sth HGSNV
  • I represents the maximum copy number of all fragments in the genome
  • i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I
  • f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E
  • ⁇ i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E
  • p i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1
  • p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all
  • BIC(e s , f s ; ⁇ , ⁇ ) represents the likelihood function of the mixed model
  • I represents the maximum copy number of all the fragments in the genome
  • J is the value of j in the formula (18).
  • Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
  • Q n can be obtained by step G, or the desired set of MAFs of all peaks ⁇ f b ⁇ can be obtained by step I, and a pair (P, ⁇ f b ⁇ ) It is possible to construct a model shown in equation (19), essentially for each pair (P, Q n ), to construct a model as shown in equation (19);
  • step H the corresponding cancer sample purity and chromosome ploidy can be found under (P, Q) obtained in step K.
  • the reference genome hs37d5 (ftp://ftp) used in the phase 3 project of the 1000 genome project is adopted.
  • the comparison software uses Burrows-Wheeler Aligner (BWA), and the comparison method uses bwa mem, and finally obtains the bam format file of the comparison result of cancer and normal samples.
  • step B samtools software is used to extract the position and length information of the read, the HGSNV site and the read covering the site. Quantity information.
  • the read information is extracted using the samtools view command, the sequence whose sequence alignment quality (MAPQ) is lower than 31 is filtered out (parameter -q 31, q indicates that the sequence with poor sequencing quality is filtered out), and the read that fails to match correctly is filtered out.
  • the parameter -f 0x2 -F 0x18, f indicates that the sequence that meets certain requirements is extracted, and F indicates that the sequence that meets certain requirements is filtered.
  • the method of the present invention collects in advance the 1000 genome (genome) program (http://www.internationalgenome.org/), the heterozygous allele locus based on a large number of samples, and filters out the B-allele frequency ( B-allele frequence) is less than 0.05 and is then made into a bed file.
  • the use of the "-1" parameter greatly accelerates the extraction speed of the HGSNV site and improves the operating efficiency of the device on the basis of ensuring sufficient HGSNV sites.
  • the step C may include four steps:
  • the whole genome is divided according to a window of a certain base length, and the number of reads of the window is covered for each window statistics, and the position of the read is represented by the midpoint of each read in the statistics;
  • the present invention creates a GC content index file for the reference genome.
  • G guanine
  • C cytosine
  • the present invention stores the index file in a binary format, which greatly speeds up the extraction of the GC content of a specific region.
  • the present invention uses the respective window GC contents extracted in steps C1 and C2 to fit the read through the following elastic network model.
  • the amount varies with GC content.
  • the invention uses the GC content of the window as the variable x, uses x, x 2 , x 3 , x 4 , x 5 , x 6 as the input variables of the elastic network model, and uses the number of reads as the output variable to construct an elastic network model such as a formula ( 20) shown.
  • y represents the number of reads observed in the window
  • X represents the input variable matrix
  • represents the variable coefficient matrix
  • j represents the variable coefficient subscript
  • P represents the total number of coefficients
  • ⁇ 1 and ⁇ 2 represent the penalty coefficients.
  • the model in step C3 is used to predict the theoretical read number ⁇ gc of each window, and the average GC content of the genome.
  • the number of reads observed in the window
  • y the number of corrected reads in the window
  • the present invention calculates the TRE value of each window using the formula (1).
  • the whole genome was then segmented using the BIC-seq software using the value of TRE.
  • the idea of BIC-seq is to use the Bayesian Information Criterion (BIC) algorithm to count the BIC values of adjacent windows. The smaller the value, the more similar the two windows are, and then merge the windows with BIC values less than 0. Finally, BIC-seq will follow the fragment.
  • BIC-seq The difference in copy number divides the whole genome into different segments. Each segment has a different TRE mean value than the adjacent segment, that is, there is a difference in copy number.
  • the number of windows included in the fragment is calculated by using the genomic fragment after the BIC-seq processing in the step D, The mean and variance of the TRE.
  • the TRE of the segment is then subjected to smooth processing.
  • the processing method is as shown in formula (22). For each genomic fragment, the mean value of TRE is taken as the mean ⁇ of the normal distribution, and the variance of TRE is taken as the variance ⁇ of the normal distribution, and the distribution of the window number of TRE in the range of [ ⁇ -2 ⁇ , ⁇ +2 ⁇ ] is calculated.
  • v is the TRE coordinate
  • the value range is [ ⁇ -2 ⁇ , ⁇ +2 ⁇ ]
  • the resolution is 0.000
  • C win is the number of windows allocated to the v-site
  • C T is the total number of windows in the segment.
  • step F traversing all Ps in the range of (0, 1) with a resolution of 0.001, using an autoregressive model.
  • Y(P) appears as a multimodal distribution, similar to that shown in Figure 3, where the horizontal axis is P and the vertical axis represents Y(P), and the present invention uses the second peak Y (P)
  • the maximum value corresponds to P as the calculation result of P, and M t is the maximum value of TRE, where M t is set to 3.
  • the step G includes three steps, and in the step G1, the TRE interval of [0, 1] is used as the variable X f .
  • the TRE interval of [0, 1] is used as the variable X f .
  • Step I according to the copy number of the cancer sample purity ⁇ and peak calculated according to the step H, the expectation of the MAF of the peak can be calculated.
  • Step I can include three steps of I1, I2, and I3.
  • Equation (15) m is the mean of the number of reads in all windows in the fragment, and v is the variance of the number of reads in all windows in the fragment.
  • the obtained p is the probability of success of the random variable used for the negative binomial distribution, r For the number of times a random variable fails, the random variable is the number of reads in a certain HGSNV.
  • HGSNV has only two genotypes, subject to the binomial distribution law, and uses equation (16) to calculate the correction value f b of f (ie, the expectation of f).
  • f b the correction value of the MAF observed as the peak of the peak.
  • k represents the number of alleles (A or B) at a certain HGSNV site
  • d is the number of reads covering the HGSNV
  • r is the number of failed random variables
  • p is used for The probability of success of a random variable with a negative binomial distribution
  • m is 0.02
  • the traversal interval of the P value is [P-0.02, P+0.02].
  • the hierarchical mixed Gaussian model provided by the invention, the rapid and accurate calculation of the purity of the cancer sample is realized, the time and economic cost of the purity estimation are saved, and the accuracy of the calculation result is improved.
  • Figure 1 shows the distribution of the number of windows in the whole genome on the TRE.
  • Figure A shows the TRE distribution that has not undergone GC correction
  • Figure B shows the TRE distribution after GC content correction.
  • Fig. 2 shows a model of TRE distribution in a cancer cell.
  • the figure shows that after the smooth treatment, the peak in the figure satisfies the distribution with a period of P, and a small number of small peaks which do not satisfy the periodic distribution are considered to be subcloned fragments.
  • Q represents a peak with a copy number of 2, and there is no segment with a copy number of 1, so the number of windows of the peak at a position of about 0.6 is zero.
  • Figure 3 shows that the horizontal axis is the vertical axis of the P-class autoregressive model.
  • Figure 4 shows a flow chart of the method and apparatus of the present invention.
  • FIG. 1 A flow chart for calculating cancer sample purity and chromosome ploidy using the apparatus of the present invention is shown in FIG.
  • the experimental material used was the normal tissue TCGA-AD-A5EJ-10A and the cancer tissue TCGA-AD of the sample (TCGA-AD-A5EJ) downloaded from the TCGA (https://cancergenome.nih.gov/) database.
  • TCGA-AD-A5EJ cancer tissue TCGA-AD of the sample
  • TCGA-AD-A5EJ cancer tissue TCGA-AD of the sample
  • TCGA-AD-A5EJ whole genome sequencing data.
  • the computing platform is ubuntu 16.04, and the specific implementation of the method is C++, Python, and R programs.
  • EXAMPLES The purity and ploidy of cancer samples were calculated using a hierarchical mixed Gaussian model based on genome-wide sequencing data for cancer tissues and normal tissues of sample TCGA-CM-4746.
  • Each column of data represents the positional information of a genomic fragment and the mean value of the TRE, the variance and the number of windows in the fragment.
  • Chromosome number Start termination TRE mean TRE variance Number of Windows Chr1 13001 45265500 0.944508 0.0014742 86128 Chr1 45265501 85978000 0.945454 0.00133201 80970 Chr1 85978001 86011500 1.27362 0.0981321 68 Chr1 86011501 116069000 0.94915 0.00153058 58775
  • step 6 the mean and variance of the TRE of each segment are obtained, and the number of windows included in the segment.
  • the TRE mean and variance of each segment are used as the mean and variance of the normal distribution, and the windows in the segment are smoothed according to the normal distribution. Summarize the TRE of all clips and the corresponding number of windows.
  • step 10 show that when P is 0.382, the mixed model takes the maximum value, and the Q at this time is 0.948. According to this, the purity of the cancer sample can be calculated to be 0.80, and the chromosome ploidy of the cancer cell is 2.14.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Organic Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Zoology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method and a device for use in calculating cancer sample purity and chromosome ploidy. By means of the provided hierarchical mixed Gaussian model, rapid and accurate calculation of cancer sample purity and chromosome ploidy is achieved, thereby saving time in and the economic costs of purity estimation, while improving the accuracy of calculation results; the present invention has broad application prospects in cancer sample purity and chromosome ploidy calculation.

Description

用于计算癌症样本纯度和染色体倍性的方法和装置Method and apparatus for calculating cancer sample purity and chromosome ploidy 技术领域Technical field
本发明属于癌症研究领域,具体涉及用于计算癌症样本中癌症细胞纯度和细胞内染色体倍性的方法和装置。The present invention is in the field of cancer research, and in particular relates to a method and apparatus for calculating cancer cell purity and intracellular chromosome ploidy in a cancer sample.
背景技术Background technique
癌症的研究是生命医学中的重要研究领域,并对人类健康生活有重大影响。癌症是一类细胞恶性增殖的疾病,因其病理十分复杂,人类尚无法攻克这类疾病。二代测序(next generation sequencing)为快速检测病人遗传信息提供了可能。然而测序需要从病人组织中提取样本,但通常癌症组织并不是只单纯地包含癌症细胞,它还有非常丰富的微环境。癌症细胞微环境指包围或伴随癌症细胞的正常细胞(non-cancerous cells)环境。癌细胞样本提取时,这些微环境会和癌细胞一起被提取,并会伴随癌细胞一起被测序[1]。癌症细胞在癌症样本中的比例被定义为癌症样本的纯度。癌症基因组通常包含着大量体细胞序列拷贝数变异,这些变异主要由基因组片段扩增或删除造成。识别特定肿瘤基因组的基因组片段拷贝数变化,是癌症基因组研究的一个重要课题。要准确鉴定基因组片段拷贝数具有一定挑战,因为癌症片段拷贝数主要由两个因素混合决定,一是癌症样本纯度,即癌症细胞在癌症样本中所占比例,二是染色体倍性[2,3]。传统中鉴定癌症样本纯度和染色体倍性的方法是使用实验技术,如定量图像分析[4]或单细胞测序[5]。但是在大型项目中,这样的方法会耗费大量人力、资金和时间。随着测序技术地发展,测序数据地快速增长,以及测序数据分析技术的积累,各种各样的癌症样本纯度算法被提出,并开发出了相对应的软件。Cancer research is an important research area in life medicine and has a major impact on human health. Cancer is a kind of malignant proliferation of cells. Because of its complicated pathology, humans cannot overcome such diseases. Second generation sequencing provides the possibility to quickly detect patient genetic information. However, sequencing requires the extraction of samples from patient tissue, but usually cancer tissue does not simply contain cancer cells, it also has a very rich microenvironment. Cancer cell microenvironment refers to the environment of non-cancerous cells surrounding or accompanying cancer cells. When the cancer cell samples are extracted, these microenvironments are extracted together with the cancer cells and sequenced along with the cancer cells [1]. The proportion of cancer cells in a cancer sample is defined as the purity of the cancer sample. The cancer genome usually contains a large number of somatic cell copy number variations, which are mainly caused by amplification or deletion of genomic fragments. Identifying changes in the copy number of genomic fragments of a particular tumor genome is an important topic in cancer genome research. Accurate identification of genomic fragment copy number has certain challenges, because the cancer fragment copy number is mainly determined by a mixture of two factors, one is the purity of cancer samples, that is, the proportion of cancer cells in cancer samples, and the other is chromosome ploidy [2, 3 ]. Traditional methods for identifying cancer sample purity and ploidy are using experimental techniques such as quantitative image analysis [4] or single cell sequencing [5]. But in large projects, such an approach can cost a lot of manpower, money, and time. With the development of sequencing technology, the rapid growth of sequencing data, and the accumulation of sequencing data analysis techniques, various cancer sample purity algorithms have been proposed and corresponding software has been developed.
基于基因组片段拷贝数变异和基于等位基因频率(突变位点的B-等位基因(B-allele))的计算方法被相继提出。基于等位基因频率的方法有PurityEst[6]和PurBayes[7],主要是依赖于随着肿瘤样本纯度和肿瘤基因组倍性的不同,等位基因的频率会有所不同。基于拷贝数变异的方法有CNAnorm[8]、THetA[9]、和ABSOLUTE[10]等。然而这两种方法都有不同程度的问题,使用等位基因频率的方法由于数据量的问题会有较大的误差,而运用拷贝数变异的方法虽然较稳定,但却无法区分样本纯度和染色体倍性的补偿效应,即存在识别问题。以上基于片段拷贝数的软件都没有解决这一问题,CNAnorm倾向于选择染色体倍性离二倍体最近的解决方案,ABSOLUTE结合了其他的经验数据,THetA则直接将所有可能结果都列出来了。Calculation methods based on genomic fragment copy number variation and based on allele frequency (B-allele of mutation sites) have been proposed. The methods based on allele frequencies are PurityEst [6] and PurBayes [7], mainly depending on the purity of the tumor sample and the ploidy of the tumor genome, the frequency of the alleles will be different. Methods based on copy number variation include CNAnorm [8], THetA [9], and ABSOLUTE [10]. However, both methods have different degrees of problems. The method of using the allele frequency has a large error due to the problem of the amount of data, while the method of using copy number variation is stable, but it cannot distinguish the purity of the sample and the chromosome. The compensation effect of ploidy, that is, the identification problem. None of the above software based on fragment copy number solves this problem. CNAnorm tends to choose the closest solution to chromosome ploidy from diploid. ABSOLUTE combines other empirical data, and THetA directly lists all possible results.
更优的方案应该是结合等位基因频率信息和片段拷贝数信息共同计算肿瘤样本纯度。PyLOH[11],patchwork[12]使用了基因组上杂合SNV(单核苷酸变异)位点的频率信息,和基因组片段的拷贝数。PyLOH一定程度上解决了“识别困难”的问题,可以更合理给出 唯一解决方案。但是其准确性较差,特别是遇到基因组中存在亚克隆(subclone)的情况下。Patchwork同时使用了两种信息,但是在计算基因型的中间步骤中,需要人工识别,人工判断的结果缺乏准确性,并且这种半自动化软件给应用带来很多不便。A better solution would be to calculate the purity of the tumor sample in combination with the allele frequency information and the fragment copy number information. PyLOH [11], patchwork [12] used frequency information of hybrid SNV (single nucleotide variation) sites on the genome, and copy number of genomic fragments. PyLOH solves the problem of “difficulties in identification” to a certain extent, and can give a more reasonable solution. However, its accuracy is poor, especially in the case of subclone in the genome. Patchwork uses both types of information, but in the intermediate steps of calculating genotypes, manual identification is required, the results of manual judgments lack accuracy, and such semi-automated software brings a lot of inconvenience to the application.
如何充分利用现有的二代测序数据准确计算癌症样本纯度和癌症细胞基因组倍性问题仍然是一项具有挑战性的工作。How to make full use of the existing second-generation sequencing data to accurately calculate the purity of cancer samples and the ploidy of cancer cell genome remains a challenging task.
参考文献references
1、Junttila M R,de Sauvage F J.Influence of tumour micro-environment heterogeneity on therapeutic response[J].Nature,2013,501(7467):346-354.1. Junttila M R, de Sauvage F J. Influence of tumour micro-environment heterogeneity on therapeutic response [J]. Nature, 2013, 501 (7467): 346-354.
2、Carter S L,Cibulskis K,Helman E,et al.Absolute quantification of somatic DNA alterations in human cancer[J].Nature Biotechnology,2012,30(5):413-21.2. Carter S L, Cibulskis K, Helman E, et al. Absolute quantification of somatic DNA alterations in human cancer [J]. Nature Biotechnology, 2012, 30(5): 413-21.
3、Oesper L,Mahmoody A,Raphael B J.Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data[J].Genome Biology,2013,14(7):R80.3. Oesper L, Mahmoody A, Raphael B J. Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data [J]. Genome Biology, 2013, 14(7): R80.
4、Yuan Y,Failmezger H,Rueda O M,et al.Quantitative Image Analysis of Cellular Heterogeneity in Breast Tumors Complements Genomic Profiling[J].Science Translational Medicine,2012,4(157):157ra143.4. Yuan Y, Failmezger H, Rueda O M, et al. Quantitative Image Analysis of Cellular Heterogeneity in Breast Tumors Complements Genomic Profiling [J]. Science Translational Medicine, 2012, 4 (157): 157ra143.
5、Navin N,Kendall J,Troge J,et al.Tumour evolution inferred by single-cell sequencing[J].Nature,2011,472(7341):90-4.5, Navin N, Kendall J, Troge J, et al. Tumour evolution inferred by single-cell sequencing [J]. Nature, 2011, 472 (7341): 90-4.
6、Su X,Zhang L,Zhang J,et al.PurityEst:estimating purity of human tumor samples using next-generation sequencing data.[J].Bioinformatics,2012,28(17):2265-2266.6. Su X, Zhang L, Zhang J, et al. Purity Est: Estimating purity of human tumor samples using next-generation sequencing data. [J]. Bioinformatics, 2012, 28(17): 2265-2266.
7、Larson N B.PurBayes:estimating tumor cellularity and subclonality in next-generation sequencing data[J].Bioinformatics,2013,29(15):1888-9.7. Larson N B. Pur Bayes: estimated tumor cellularity and subclonality in next-generation sequencing data [J]. Bioinformatics, 2013, 29(15): 1888-9.
8、Gusnanto A,Wood H M,Pawitan Y,et al.Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data[J].Bioinformatics,2012,28(1):40-47.8. Gusnanto A, Wood H M, Pawitan Y, et al. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data [J]. Bioinformatics, 2012, 28(1): 40-47.
9、Oesper L,Mahmoody A,Raphael B J.Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data[J].Genome Biology,2013,14(7):R80.9. Oesper L, Mahmoody A, Raphael B J. Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data [J]. Genome Biology, 2013, 14(7): R80.
10、Carter S L,Cibulskis K,Helman E,et al.Absolute quantification of somatic DNA alterations in human cancer[J].Nature Biotechnology,2012,30(5):413-21.10. Carter S L, Cibulskis K, Helman E, et al. Absolute quantification of somatic DNA alterations in human cancer [J]. Nature Biotechnology, 2012, 30(5): 413-21.
11、Li Y,Xie X.Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity[J].Bioinformatics,2015,30(4):2121.11. Li Y, Xie X. Deconvolving tumor purity and ploidy by combining copy number alterations and loss of heterozygosity [J]. Bioinformatics, 2015, 30(4): 2121.
12、Mayrhofer M,Dilorenzo S,Isaksson A.Patchwork:allele-specific copy number analysis of whole-genome sequenced tumor tissue[J].Genome Biology,2013,14(3):R24.12. Mayrhofer M, Dilorenzo S, Isaksson A. Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue [J]. Genome Biology, 2013, 14(3): R24.
发明内容Summary of the invention
术语定义Definition of Terms
为了更好地理解本发明,下面提供相关的解释和说明:For a better understanding of the invention, relevant explanations and explanations are provided below:
Whole Genome Sequencing(WGS):使用二代测序技术的全基因组测序。Whole Genome Sequencing (WGS): Whole genome sequencing using second generation sequencing technology.
read:高通量测序平台产生的测序序列。Read: Sequencing sequence generated by a high-throughput sequencing platform.
测序深度:测序得到的碱基(bp)总量与基因组(Genome)大小的比值,它是评价测序量的指标之一。Sequencing depth: The ratio of the total number of bases (bp) obtained by sequencing to the size of the genome (Genome), which is one of the indicators for evaluating the amount of sequencing.
window(窗口):按照一定长度划分的基因组片段,该长度代表window大小。本方法中window大小可由使用者自由设置,通常设置为几百碱基。一个大基因组片段S可以包含大量window。Window: A genomic fragment divided by a length that represents the size of the window. In this method, the window size can be freely set by the user, and is usually set to several hundred bases. A large genomic fragment S can contain a large number of windows.
Tumor Read Enrichment(TRE):癌症片段读长富集程度e s,指癌症样本中某一片段S内read数量与相应正常样本中对应片段read数量的比值,定义公式如下: Tumor Read Enrichment (TRE): Cancer long fragment reads the enrichment e s, refers to the ratio of the number of read number of segments within a cancerous samples read with a corresponding segment S corresponding normal sample, the following formula is defined:
Figure PCTCN2018078908-appb-000001
Figure PCTCN2018078908-appb-000001
公式(1)中,
Figure PCTCN2018078908-appb-000002
Figure PCTCN2018078908-appb-000003
分别表示在癌症样本中覆盖片段s的read数量和相匹配的正常样本中覆盖片段s的read数量,N t表示癌症样本全基因组测序获得read总数量,N n表示相应正常样本全基因组测序获得read总数量。
In formula (1),
Figure PCTCN2018078908-appb-000002
with
Figure PCTCN2018078908-appb-000003
Respectively indicates the number of reads of the covered fragment s in the cancer sample and the number of reads of the covered fragment s in the matched normal sample, N t represents the total number of reads of the cancer sample whole genome sequencing, and N n represents the corresponding normal sample whole genome sequencing obtained read The total number.
Heterozygous Germline Single Nucleotide Variants(HGSNV):杂合生殖系细胞单碱基变异,由于人类染色体属于二倍体,体细胞均由胚胎细胞发育而来,而生殖细胞中HGSNV位点只有两种碱基类型A和B,其中一种来源于父本,另一种来源于母本。Heterozygous Germline Single Nucleotide Variants (HGSNV): Single base variation in heterozygous germline cells. Since human chromosomes are diploid, somatic cells are derived from embryonic cells, while HGSNV sites in germ cells have only two base types. A and B, one of which comes from the father and the other from the mother.
Major Allele Fraction(MAF):主要等位基因(allele)分数,本发明中使用的HGSNV只有两种等位基因,一种等位基因与参考基因组相同,另一种与参考基因组不同。这两种等位基因型分数的计算方法为覆盖某一等位基因的read数量除以覆盖该位点总read数量的比值,MAF就是两种等位基因分数中的较大值。计算公式如(1.1)所示,n r为包含与参考基因组相同等位基因的read数量,n a为包含另一种等位基因的read的数量,n t表示覆盖该HGSNV位点的总read数量,C为该HGSNV的MAF值。MAF是相对于HGSNV的概念,本发明中“片段的MAF”指片段内所有HGSNV的MAF均值,“peak的MAF”指peak中所有片段包含的HGSNV的MAF均值。 Major Allele Fraction (MAF): The major allele score. The HGSNV used in the present invention has only two alleles, one allele is identical to the reference genome and the other is different from the reference genome. The two allelic scores are calculated by covering the number of reads of an allele divided by the ratio of the total number of reads covering the site, and MAF is the larger of the two allelic scores. The calculation formula is as shown in (1.1), where n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total read covering the HGSNV site. The quantity, C is the MAF value of the HGSNV. MAF is a concept relative to HGSNV. In the present invention, "MAF of a fragment" refers to the MAF mean of all HGSNVs in a fragment, and "MAF of peak" refers to the MAF mean of HGSNV contained in all fragments in a peak.
Figure PCTCN2018078908-appb-000004
Figure PCTCN2018078908-appb-000004
major allele copy number:主要等位基因拷贝数,指在拷贝数为i的片段中,主要等位基因拷贝数的取值,它的取值范围为大于等于
Figure PCTCN2018078908-appb-000005
的整数。
Major allele copy number: the copy number of the major allele, which refers to the value of the copy number of the major allele in the fragment with copy number i, which is greater than or equal to
Figure PCTCN2018078908-appb-000005
The integer.
peak:指基因组所有window的TRE分布中,聚集在一起的TRE簇。如附图1所示,图A表示基因组上所有window的TRE分布,纵轴表示对应某TRE位点的window 总数量,该图为基因组GC含量校正之前的TRE分布,图B表示GC含量校正之后的TRE分布,图B中可以看到window明显以簇聚集,本方法将通过类自回归模型鉴定出来的TRE簇定义为peak,本质上是具有相同拷贝的基因组片段内window的聚集。Peak: refers to the TRE clusters that are clustered together in the TRE distribution of all windows in the genome. As shown in Figure 1, Figure A shows the TRE distribution of all windows on the genome, and the vertical axis shows the total number of windows corresponding to a TRE site. The figure shows the TRE distribution before the genomic GC content correction. Figure B shows the GC content correction. The TRE distribution, in Figure B, can be seen that the window is obviously clustered. This method defines the TRE cluster identified by the autoregressive model as peak, which is essentially the aggregation of the window within the genome segment with the same copy.
癌症样本:指从某患癌症的个体身上取下的癌症组织,它包含了一部分癌症细胞和一部分正常细胞。Cancer sample: A cancer tissue taken from an individual with cancer, which contains a portion of cancer cells and a portion of normal cells.
P:指两个相邻peak之间的间距,由于peak是一个簇,这里peak的TRE由peak的TRE均值来表示,所以实际上是两个相邻peak的TRE均值的差。由于peak是具有相同拷贝数基因组片段内window的聚集,所以这里也表述为相邻拷贝数片段TRE的差值。P: refers to the spacing between two adjacent peaks. Since peak is a cluster, the TRE of peak here is represented by the TRE mean of peak, so it is actually the difference of the TRE mean of two adjacent peaks. Since peak is an aggregation of windows within the same copy number genome segment, it is also expressed here as the difference of adjacent copy number segments TRE.
发明目的Purpose of the invention
本发明的目的在于克服现有技术的缺陷,提供一种全自动、高效率、高准确性的计算癌症样本纯度和染色体倍性的方法和装置。本发明在癌症样本纯度和染色体倍性计算上具有广阔的运用前景。The object of the present invention is to overcome the deficiencies of the prior art and to provide a fully automatic, high efficiency, high accuracy method and apparatus for calculating cancer sample purity and chromosome ploidy. The invention has broad application prospects in the calculation of cancer sample purity and chromosome ploidy.
技术方案Technical solutions
为实现上述发明目的,本发明采取的技术方案为:通过癌症样本和匹配的正常样本的全基因组测序数据,对不同拷贝数片段的TRE和HGSNV的MAF分布构建混合高斯模型,计算癌症样本纯度和染色体倍性。In order to achieve the above object, the technical solution adopted by the present invention is to construct a mixed Gaussian model of the MAF distribution of TRE and HGSNV of different copy number fragments by using whole genome sequencing data of cancer samples and matched normal samples, and calculate the purity of the cancer sample and Chromosome ploidy.
本发明主要运用了全基因组测序数据的TRE信息和HGSNV的MAF信息。TRE基本反映了癌症样本拷贝数变异情况,HGSNV的MAF信息基本反映了癌症样本的基因型。The invention mainly uses the TRE information of the whole genome sequencing data and the MAF information of the HGSNV. TRE basically reflects the copy number variation of cancer samples, and the MAF information of HGSNV basically reflects the genotype of cancer samples.
TRE的差别主要来源于基因组片段的拷贝数差异,高拷贝数基因组片段内测序获得的read数量一定大于低拷贝数基因组片段测序获得的read数量,通过片段内read数量差异计算片段拷贝数差异是基因组拷贝数检测中的常用方法。但是大多数研究中,直接使用癌症样本片段内read数量除以正常样本该片段内read数量的比值(ratio)来进行read数量差异评估。本发明使用公式(1)所示的TRE来评估片不同片段的read数量差异。传统方法计算所得ratio不仅受癌症样本纯度与染色体倍性的影响,还受到癌症样本和正常样本测序深度的影响,而TRE不会受到样本测序深度的影响。The difference in TRE is mainly due to the difference in copy number of genomic fragments. The number of reads obtained by sequencing in high copy number genomic fragments must be greater than the number of reads obtained by sequencing of low copy number genomic fragments. The difference in copy number of fragments is calculated by the difference in the number of reads in the fragments. Common methods in genome copy number detection. However, in most studies, the difference in the number of reads in the cancer sample segment was divided by the ratio of the number of reads in the normal sample to calculate the difference in the number of reads. The present invention uses the TRE shown in the formula (1) to evaluate the difference in the number of reads of different segments of the slice. The calculated ratio of the traditional method is not only affected by the purity and chromosome ploidy of the cancer sample, but also by the depth of sequencing of the cancer sample and the normal sample, and the TRE is not affected by the depth of the sample sequencing.
单独依赖read数量差异,无法确定各拷贝数片段的基因型,更重要的是无法区分样本纯度与样本倍性的补偿效应。而结合拷贝数差异片段内的HGSNV可以提供基因型信息,并帮助解决纯度与倍性的补偿效应,然而在前人的研究中,并没有高效利用HGSNV的方法,大部分方法采用枚举的方式将不同拷贝数片段可能对应的基因型一一列出,然后对排列组合的结果进行计算,挑选最可信的结果。这些方法共同的特点是,方法计算时间长,准确性差,对拷贝数很高或基因组变化较大的样本效果很差。本发明依据HGSNV的MAF与TRE的混合高斯模型计算癌症样本纯度和染色体倍性,能显著减少计算时间,并提高计算结果准确率。The genotype of each copy number fragment cannot be determined by relying solely on the difference in the number of reads, and more importantly, the compensation effect of sample purity and sample ploidy cannot be distinguished. HGSNV combined with copy number difference fragments can provide genotype information and help to solve the compensation effect of purity and ploidy. However, in previous studies, there is no efficient use of HGSNV, and most methods use enumeration. The genotypes that may correspond to different copy number fragments are listed one by one, and then the results of the permutation and combination are calculated to select the most reliable results. The common feature of these methods is that the method has a long calculation time and poor accuracy, and the sample with high copy number or large genomic variation has a poor effect. The invention calculates the purity and chromosome ploidy of the cancer sample according to the mixed Gaussian model of MAF and TRE of HGSNV, can significantly reduce the calculation time and improve the accuracy of the calculation result.
假设某癌症样本的纯度为γ,那么癌症样本中的正常细胞比例为1-γ。癌症样本中正常细胞的染色体倍性为2,癌症细胞的染色体倍性为κ。那么癌症样本的染色体倍性ω如下公式(2)所示。Assuming that the purity of a cancer sample is γ, the proportion of normal cells in the cancer sample is 1-γ. The normal cell has a chromosome ploidy of 2 in the cancer sample, and the cancer cell has a chromosome ploidy of κ. Then the chromosome ploidy ω of the cancer sample is as shown in the formula (2).
ω=(1-γ)×2+γ×κ   (2)ω=(1-γ)×2+γ×κ (2)
假设在癌症细胞中某一片段S的拷贝数为C S。那么癌症样本的片段S的拷贝数C t应该为如下公式(3)所示。 It is assumed that the copy number of a certain fragment S in a cancer cell is C S . Then the copy number C t of the fragment S of the cancer sample should be as shown in the following formula (3).
C t=(1-γ)×2+γ×C s   (3) C t =(1-γ)×2+γ×C s (3)
对于基因组片段S,TRE的计算方式如公式(1)所示。而TRE的期望值(expectation)E(e s)的推导公式如下公式(4)所示,式中的
Figure PCTCN2018078908-appb-000006
N n和N t与公式(1)中含义相同。
For the genomic fragment S, the TRE is calculated as shown in equation (1). The derivation formula of the expectation E(e s ) of TRE is as shown in the following formula (4),
Figure PCTCN2018078908-appb-000006
N n and N t have the same meanings as in formula (1).
Figure PCTCN2018078908-appb-000007
Figure PCTCN2018078908-appb-000007
为了更进一步引出e s,本方法定义了一些帮助理解的参数。片段S的长度L S,人类参考基因组的长度L gw,癌症样本的测序深度
Figure PCTCN2018078908-appb-000008
正常样本的测序深度
Figure PCTCN2018078908-appb-000009
那么片段S在癌症样本中测序深度为
Figure PCTCN2018078908-appb-000010
片段S在正常样本中测序深度为
Figure PCTCN2018078908-appb-000011
λ S是指与片段S特性(如GC含量等引起测序深度偏好的特性)有关的参数,所以在癌症和正常样本中是一样的。进一步通过γ,κ,C s来表示e s,如公式(5)所示。
To further drawn e s, the method defines a parameter of some help understanding. Length of fragment S L S , length of human reference genome L gw , depth of sequencing of cancer samples
Figure PCTCN2018078908-appb-000008
Sampling depth of normal samples
Figure PCTCN2018078908-appb-000009
Then the fragment S is sequenced in a cancer sample to a depth of
Figure PCTCN2018078908-appb-000010
Fragment S is sequenced in a normal sample to a depth of
Figure PCTCN2018078908-appb-000011
λ S refers to a parameter related to the characteristics of the segment S (such as the GC content and the like which cause the depth of preference of the sequencing), so it is the same in cancer and normal samples. Further, e s is represented by γ, κ, C s as shown in the formula (5).
Figure PCTCN2018078908-appb-000012
Figure PCTCN2018078908-appb-000012
公式(5)中C s表示在癌症细胞中片段S的拷贝数,那么当片段S的拷贝数为i时和i+1时对应的TRE均值S i和S i+1分别如公式(6)和公式(7)所示: In formula (5), C s represents the copy number of the fragment S in the cancer cell, then the corresponding TRE mean S i and S i+1 when the copy number of the fragment S is i and i+1 are as shown in the formula (6), respectively. And formula (7):
Figure PCTCN2018078908-appb-000013
Figure PCTCN2018078908-appb-000013
Figure PCTCN2018078908-appb-000014
Figure PCTCN2018078908-appb-000014
通过公式(6)和公式(7),对于相邻的拷贝数对应的片段,它们的TRE的差值P如公式(8)所示,可见P值的大小与片段具体拷贝数没有关系,它只决定于癌症样本纯度和染色体倍性。通过附图2可以直观的看到在TRE分布图中,peak之间的距离是恒定的。By formula (6) and formula (7), for the segments corresponding to adjacent copy numbers, the difference P of their TRE is as shown in formula (8), and it can be seen that the magnitude of the P value has nothing to do with the specific copy number of the fragment. It is only determined by the purity and chromosome ploidy of cancer samples. It can be visually seen from Figure 2 that the distance between peaks is constant in the TRE profile.
Figure PCTCN2018078908-appb-000015
Figure PCTCN2018078908-appb-000015
此外,对于i=2即拷贝数为2的片段,它们的TRE值Q如公式9所示。附图2中Q对应的peak的TRE值略大于1。Further, for segments where i=2, that is, a copy number of 2, their TRE values Q are as shown in Equation 9. The TRE value of the peak corresponding to Q in FIG. 2 is slightly larger than 1.
Figure PCTCN2018078908-appb-000016
Figure PCTCN2018078908-appb-000016
通过上述公式(8)和(9),可以解得癌症样本的纯度(γ)和染色体倍性(κ)分别为:Through the above formulas (8) and (9), the purity (γ) and ploidy (κ) of the cancer sample can be solved as follows:
Figure PCTCN2018078908-appb-000017
Figure PCTCN2018078908-appb-000017
Figure PCTCN2018078908-appb-000018
Figure PCTCN2018078908-appb-000018
通过以上分析可以得知,通过确定P和Q可以计算出癌症样本纯度γ和染色体倍性κ。From the above analysis, it can be known that the cancer sample purity γ and chromosome ploidy κ can be calculated by determining P and Q.
如附图2所示,计算出全基因组所有片段的TRE分布后,可以计算peak间的间距得到P。在前人的研究方法中,patchwork[12]使用相邻拷贝数片段的对应的read数量的比值的间距来辅助计算癌症样本纯度,但是该研究无法自动识别read数量比值之间的间距,需要人工识别图像确定read数量比值间距来进行下一步计算,效率和准确性都比较低。本发明开创性的使用类自回归模型鉴定TRE之间的间距,如公式(12)、(13)所示。公式(12)和(13)中,X t表示0到M t之间的TRE值;t表示扩大了1000倍的TRE值;M t表示TRE的最大值;P表示两个TRE位点的间隔;C(X t)表示在TRE为X t的位点,对应的window数量;C(X t+1000×P)表示在TRE为X t+1000×P的位点,对应的window数量;Y(P)表示在P下,类自回归模型的函数值;显而易见当P=0时,Y(P)的取值最大,但这时候的P并不是实际peak之间的间距。 As shown in Figure 2, after calculating the TRE distribution of all fragments of the whole genome, the spacing between peaks can be calculated to obtain P. In previous research methods, patchwork [12] used the ratio of the ratio of the corresponding number of reads of adjacent copy number segments to assist in the calculation of cancer sample purity, but the study could not automatically identify the gap between the read number ratios, requiring manual The recognition image determines the read number ratio interval for the next calculation, and the efficiency and accuracy are relatively low. The inventive inventive use of an autoregressive model to identify the spacing between TREs is shown in equations (12) and (13). In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; P represents the interval of two TRE sites. ; C(X t ) denotes the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) denotes the position where TRE is X t+1000×P , corresponding to the number of windows; Y (P) represents the function value of the autoregressive model under P; it is obvious that when P=0, Y(P) takes the largest value, but P at this time is not the spacing between actual peaks.
Figure PCTCN2018078908-appb-000019
Figure PCTCN2018078908-appb-000019
Figure PCTCN2018078908-appb-000020
Figure PCTCN2018078908-appb-000020
以0.001为分辨率,遍历0到1之间的所有P值,然后求Y(P)。Y(P)的值分布如附图3所示。根据公式(13)的特点,我们可以知道,当P等于0时,Y(P)的值会是最大的,但此时的P并不是peak之间的间距。我们选择图中第二高峰中Y(P)的最大值对应的x轴坐标值P作为peak之间的间距P的计算结果。With a resolution of 0.001, traverse all P values between 0 and 1, then find Y(P). The value distribution of Y(P) is as shown in FIG. According to the characteristics of formula (13), we can know that when P is equal to 0, the value of Y(P) will be the largest, but P at this time is not the spacing between peaks. We select the x-axis coordinate value P corresponding to the maximum value of Y(P) in the second peak in the figure as the calculation result of the pitch P between peaks.
如图1中B图所示的peak之所以簇状分布,是因为具有相同拷贝数的基因组片段的TRE值(指片段内所有window的TRE的均值)并不完全相等,同拷贝数片段TRE相互之间存在误差。该误差服从高斯分布,所以图B中的簇状分布被认为是高斯分布。The peak distribution of the peak shown in Figure B is because the TRE value of the genomic fragment with the same copy number (the mean value of the TRE of all windows in the fragment) is not completely equal, and the copy number fragment TRE is mutually There is an error between them. This error obeys a Gaussian distribution, so the clustered distribution in Figure B is considered to be a Gaussian distribution.
如附图2所示,P确定以后,peak会被识别出来,但是有部分基因组片段没有落在识别出的peak上,这些片段被称作亚克隆片段(subclone segmentation)。在考虑亚克隆片段的情况下,会对后面公式(17)和(18)所示的高斯模型取值有影响,进而会影响最终混合高斯模型的取值。由于在后续的分析中,本发明只需考虑落在peak位点的片段,由此排除了亚克隆片段的干扰。As shown in Figure 2, after P is determined, peak will be recognized, but some of the genomic fragments do not fall on the recognized peak. These fragments are called subclone segmentation. In the case of considering subcloned fragments, there will be an influence on the values of the Gaussian models shown in the following formulas (17) and (18), which in turn will affect the value of the final mixed Gaussian model. Since in the subsequent analysis, the present invention only needs to consider the fragment falling at the peak position, thereby eliminating the interference of the subcloned fragment.
在如图2所示的TRE分布图中,Q位点表示拷贝数为2的片段对应的TRE值。首先 我们可以推测,如果癌症细胞基因组中,存在部分片段的拷贝数为1,部分片段的拷贝数为0,那么在Q位点之前应该存在两个peak分别对应拷贝数1和0。如果不存在拷贝数为1的片段,只存在拷贝数为0的片段,那么在Q位点之前距离2P的位点存在一个peak,而在Q位点之前距离P的位点peak的window数为0,这也就是图2所示的情形。另一种情况假若拷贝数为1和0的片段都没有,那么在Q位点之前距离P和距离2P的位点对应的window数都为0。那么对于TRE的分布图,对于X f,即第一个出现的peak,它可能对应的拷贝数为2(拷贝数1和0的peak对应的window就是0),也可能对应的拷贝数是1(拷贝数为0的片段对应peak的window为0),也有可能对应的拷贝数是0。 In the TRE profile as shown in FIG. 2, the Q site represents the TRE value corresponding to the segment of copy number 2. First, we can speculate that if the copy number of a partial fragment in the cancer cell genome is 1, and the copy number of the partial fragment is 0, then there should be two peaks corresponding to the copy number 1 and 0 before the Q site. If there is no fragment with a copy number of 1, there is only a fragment with a copy number of 0, then there is a peak at a distance of 2P before the Q site, and the number of windows at the distance peak of the distance P before the Q site is 0, this is the situation shown in Figure 2. In the other case, if there are none of the segments with copy numbers 1 and 0, then the number of windows corresponding to the distance P and the distance 2P before the Q site is 0. Then for the distribution map of TRE, for X f , that is, the first peak that appears, it may correspond to a copy number of 2 (the copy corresponding to the peak of copy number 1 and 0 is 0), and the corresponding copy number may be 1 (The fragment with a copy number of 0 corresponds to the window of the peak is 0.) It is also possible that the corresponding copy number is 0.
通过上述分析,我们可以知道,图2中的第一个peak即X f对应的片段的拷贝数有几种不同的可能,而每一种可能都会使Q对应不同的peak。本发明通过混合高斯模型计算出了最可能的X f对应的片段的拷贝数,从而确定了Q的取值,最终得到了癌症样本纯度和染色体倍性。首先我们需要鉴定出X f对应片段的拷贝数的所有可能值。本发明通过如下公式(13.1)来确定X f的值。(13.1)中,C(X f+P)表示在TRE为X f+P的位点,对应的window数量,n表示M t以内peak的最大数量。当f(X f)取最大值时X f为第一个peak的TRE均值。 From the above analysis, we can know that the copy number of the first peak in Figure 2, that is, the X f corresponding segment has several different possibilities, and each of them may make Q correspond to a different peak. The present invention calculates the copy number of the most likely X f corresponding segment by mixing the Gaussian model, thereby determining the value of Q, and finally obtaining the purity and chromosome ploidy of the cancer sample. First we need to identify all possible values for the copy number of the corresponding fragment of X f . The present invention determines the value of X f by the following formula (13.1). In (13.1), C(X f +P) represents the number of windows corresponding to the position where TRE is X f + P, and n represents the maximum number of peaks within M t . When f(X f ) takes the maximum value, X f is the TRE mean of the first peak.
Figure PCTCN2018078908-appb-000021
Figure PCTCN2018078908-appb-000021
然后使用公式(13.2)求X f之前最多可能有几个peak。其中X f表示第一个peak的TRE均值,P表示相邻拷贝数片段对应的peak之间的间距,floor表示向下取整数,当N=0时,表示X f之前没有peak,X f对应的片段拷贝数为0;当N=1时,表示X f之前最多可能有1个peak,也可能没有peak;当N=2时,表示X f之前最多可能有2个peak,也可能只有一个peak或者没有peak; Then use Equation (13.2) to find up to a maximum of several peaks before X f . Wherein X f represents a mean peak of TRE, P denotes the distance between the copy number of segments corresponding to an adjacent peak, floor represents rounded down, when N = 0, indicating no prior peak X f, X f corresponds The fragment copy number is 0; when N=1, it means that there may be at most 1 peak before X f , or there may be no peak; when N=2, it means there may be up to 2 peaks before X f , or maybe only one Peak or no peak;
Figure PCTCN2018078908-appb-000022
Figure PCTCN2018078908-appb-000022
对于X f之前可能有(1,2,3...N)个peak的情形,每一种情形下都可以通过如下公式(13.3)计算出一个对应的Q值。根据Q的定义,我们知道Q为拷贝数为2的片段的peak对应的TRE值。首先可以推断拷贝数为0的片段对应的TRE值为X f-n×P,其中n表示X f之前peak的个数,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f含义与公式(13.1)中相同,公式(13.3)如下所示,其中Q n表示在X f之前存在n个peak时,Q的取值。 For the case where there may be (1, 2, 3...N) peaks before X f , in each case, a corresponding Q value can be calculated by the following formula (13.3). According to the definition of Q, we know that Q is the TRE value corresponding to the peak of the fragment with a copy number of 2. First, it can be inferred that the TRE value of the fragment with a copy number of 0 is X f -n×P, where n represents the number of peaks before X f , the value ranges from 0 to N, and P represents the adjacent copy number. The spacing between the peaks corresponding to the segments, the meaning of X f is the same as in the formula (13.1), and the formula (13.3) is as follows, where Q n represents the value of Q when there are n peaks before X f .
Q n=X f-n×P+2×P=X f+(2-n)×P,n∈[0,N]  (13.3) Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3)
根据以上分析,可以得到对于X f之前可能有(0,1,2,3...N)个peak的情形时,Q n取值可能为(Q 0,Q 1,Q 2,Q 3...Q N)。而前面的类自回归模型已经计算出了P,那么对于每一个 可能的Q,我们可以通过公式(10)和(11)计算出相应的γ和κ。本发明通过混合高斯模型计算出X f之前最可能的peak数量n,从而确定了X f对应的片段的拷贝数,进而确定了Q的取值,最终得到了癌症样本纯度和染色体倍性。具体方法如下说明。 According to the above analysis, it can be obtained that there may be ( 0 , 1 , 2 , 3...N) peaks before X f , and the value of Q n may be ( Q 0 , Q 1 , Q 2 , Q 3 . ..Q N ). While the previous autoregressive model has calculated P, then for each possible Q, we can calculate the corresponding γ and κ by equations (10) and (11). The invention calculates the most likely peak number n before X f by mixing the Gaussian model, thereby determining the copy number of the fragment corresponding to X f , and then determining the value of Q, and finally obtaining the purity and chromosome ploidy of the cancer sample. The specific method is explained as follows.
对于Q n的每一个可能的取值,结合P,我们可以计算得到相应的γ,然后计算各个拷贝数片段内HGSNV的MAF的理论值。其理论计算方式如公式(14)所示。C mcp表示主要等位基因的拷贝数(major allele copy number),C cp表示peak的整体拷贝数。f表示该peak内MAF的理论值。 For each possible value of Q n , in combination with P, we can calculate the corresponding γ and then calculate the theoretical value of the MAF of HGSNV in each copy number segment. The theoretical calculation method is shown in formula (14). C mcp represents the major allele copy number, and C cp represents the overall copy number of peak. f represents the theoretical value of the MAF within the peak.
Figure PCTCN2018078908-appb-000023
Figure PCTCN2018078908-appb-000023
但是在实际情况下当测序深度比较低时,f与真实值(期望值)会有较大的误差。这里需要进一步校正f,校正方法如公式(15)、(16)所示。式(15)中m是某peak内所有window中read数量的均值,v是peak内所有window中read数量的方差,所求得的p是用于负二项分布的随机变量成功的概率,r为随机变量失败的次数,这里的随机变量d为测序获取到的测序深度(read coverage)。However, in actual cases, when the sequencing depth is relatively low, f has a large error with the true value (expected value). Here, further correction f is required, and the correction method is as shown in equations (15) and (16). In equation (15), m is the mean of the number of reads in all windows in a peak, and v is the variance of the number of reads in all windows in the peak. The obtained p is the probability of success of the random variable used in the negative binomial distribution, r For the number of times a random variable fails, the random variable d here is the read coverage obtained by sequencing.
Figure PCTCN2018078908-appb-000024
Figure PCTCN2018078908-appb-000024
在某一个测序深度即d下,MAF实际上是服从以f为概率、d为实验次数的二项分布。本发明以如下公式(16)对f进行校正,得到MAF的期望值f b。公式中k表示在某个HGSNV位点,某一种等位基因(A或B)的数量,测得的等位基因总量为d(与测序深度相等)。 At a certain sequencing depth, d, the MAF actually obeys the binomial distribution with f as the probability and d as the number of experiments. The present invention corrects f by the following formula (16) to obtain the expected value f b of the MAF. In the formula, k represents the number of alleles (A or B) at a certain HGSNV site, and the total amount of alleles measured is d (equal to the sequencing depth).
Figure PCTCN2018078908-appb-000025
Figure PCTCN2018078908-appb-000025
公式(13.2)表明每个基因组片段的拷贝数有N种可能。公式(14)表明,某一拷贝数片段可以有多种主要等位基因拷贝数,于是对每个基因组peak,可以算出多个f,也可以算出多个f b,取距离peak实际观测MAF均值最近的f b作为该peak的MAF期望。而全基因组有多个peak,每个peak的MAF期望不同,对应算出多个MAF期望值{f b}。考虑到某一peak内HGSNV的MAF会存在一定误差但也近似服从高斯分布,peak内所有HGSNV的MAF期望值可以由实际数据直接计算获得。假设某peak的基因型一定,那么通过比较peak的MAF观测值与peak的{f b}的值就可以判断该peak的拷贝数和基因型。也就可以计算出拷贝数为2的peak对应的TRE值即Q的位置。同时也为了进一步校正P,本发明提出了一种混合高斯模型对TRE和HGSNV的观测数据进行拟合。 Equation (13.2) indicates that there are N possibilities for the copy number of each genomic fragment. Equation (14) shows that a certain copy number fragment can have multiple major allele copy numbers, so for each genome peak, multiple f can be calculated, and multiple f b can be calculated, and the average observed MAF value is taken from the distance peak. The most recent f b is expected as the MAF of the peak. The whole genome has multiple peaks, and the MAF expectation of each peak is different, corresponding to calculating multiple MAF expected values {f b }. Considering that there is a certain error in the MAF of HGSNV in a certain peak, but also obeys the Gaussian distribution, the expected value of MAF of all HGSNV in peak can be directly calculated from the actual data. Assuming that the genotype of a certain peak is certain, the copy number and genotype of the peak can be judged by comparing the MAF observation of peak with the value of {f b } of peak. In other words, the TRE value corresponding to the peak of the copy number 2, that is, the position of Q can be calculated. At the same time, in order to further correct P, the present invention proposes a hybrid Gaussian model to fit the observation data of TRE and HGSNV.
通过前面的分析可以得知因为公式(12)中ε t的存在,X t并不能十分准确的代表各个peak的TRE均值。TRE的高斯分布模型如公式17所示: It can be seen from the previous analysis that X t does not accurately represent the TRE mean of each peak because of the existence of ε t in equation (12). The Gaussian distribution model of TRE is shown in Equation 17:
Figure PCTCN2018078908-appb-000026
Figure PCTCN2018078908-appb-000026
其中,L(e s;γ,κ)表示基因组片段TRE的似然函数。N表示基因组上的所有window的数量。I表示基因组中所有片段的最大的拷贝数。σ i表示拷贝数为i的所有片段的TRE的标准差。e s为第s个window的TRE观测值,S i表示第i个peak的TRE均值。p i表示第s个window的拷贝数为i的权重,本公式中对所有的i,p i均取值为1。该公式表明似然函数的大小与S i的取值相关,当S i与e S越接近,似然函数的取值越大,同时也表示P值越接近真实值。使用L(e s;γ,κ)的极大似然估计可以计算出较合理的P值。 Where L(e s ; γ, κ) represents the likelihood function of the genomic fragment TRE. N represents the number of all windows on the genome. I represents the largest copy number of all fragments in the genome. σ i represents the standard deviation of the TRE of all segments of copy number i. e s is the TRE observation of the sth window, and S i represents the TRE mean of the ith peak. p i represents the weight of the copy number of the sth window, i, and all i, p i in this formula take the value 1. The formula shows that the size of the likelihood function is related to the value of S i . When S i and e S are closer, the value of the likelihood function is larger, and the closer the P value is to the true value. A reasonable P value can be calculated using the maximum likelihood estimation of L(e s ; γ, κ).
然而在部分情况下,P在较小的区间内波动时(如[-0.005,0.005]),对应的似然函数值可能相同。本发明通过结合HGSNV的高斯分布模型如公式(18)所示,来更近一步确定P值,同时确定Q值。However, in some cases, when P fluctuates within a small interval (such as [-0.005, 0.005]), the corresponding likelihood function values may be the same. The present invention further determines the P value and determines the Q value by combining the Gaussian distribution model of HGSNV as shown in the formula (18).
Figure PCTCN2018078908-appb-000027
Figure PCTCN2018078908-appb-000027
其中,L(f s;γ,κ)表示HGSNV的似然函数。M表示基因组中所有HGSNV。S表示第S个HGSNV。I表示基因组中所有片段的最大的拷贝数。F i,j为拷贝数为i、主要等位基因的拷贝数为j的片段内HGSNV的MAF期望值,即公式(16)计算出的f b。f s表示该片段内MAF的观测值均值,σ i,j表示该片段内HGSNV的MAF观测值的标准差。p i,j表示在主要等位基因的拷贝数为j时,高斯分布的权重,对所有的i和j,p i,j取值均为1。p i表示第S个HGSNV所在片段的拷贝数为i的权重,对所有的i,p i取值均为1。公式(18)表明,似然函数的大小与F i,j值相关,当F i,j与f s越接近,似然函数越大,表明f s越准确,同时表明公式(14)中的f越准确,从而可以得到各片段对应的C cp与C mcp。于是确定了Q的取值。为了得到最准确的P与Q,本方法将公式(17)和公式(18)相加,得到混合高斯模型。 Where L(f s ; γ, κ) represents the likelihood function of HGSNV. M represents all HGSNV in the genome. S represents the Sth HGSNV. I represents the largest copy number of all fragments in the genome. F i,j is the expected value of MAF of HGSNV in the fragment whose copy number is i and the copy number of the major allele is j, that is, f b calculated by the formula (16). f s represents the mean value of the observed values of the MAF within the segment, and σ i,j represents the standard deviation of the MAF observations of the HGSNV within the segment. p i,j denotes the weight of the Gaussian distribution when the copy number of the major allele is j, and the value of all i and j, p i,j is 1. p i represents the weight of the copy number of the segment where the Sth HGSNV is located, and the value of all i, p i is 1. Equation (18) shows that the size of the likelihood function is related to the value of F i,j . When F i,j is closer to f s , the larger the likelihood function, the more accurate f s is , and the formula (14) is also shown. The more accurate f is, the corresponding C cp and C mcp of each fragment can be obtained. Then the value of Q is determined. In order to get the most accurate P and Q, the method adds equation (17) and formula (18) to obtain a mixed Gaussian model.
但对该混合模型统计计算容易发生模型过拟合现象。本发明进一步使用了贝叶斯信息准则(Bayesian Information Criterion,BIC)方法,给混合高斯模型一个罚分函数,用于控制模型的过拟合,最终混合高斯模型如公式(19)所示:However, the statistical calculation of the hybrid model is prone to model overfitting. The present invention further uses a Bayesian Information Criterion (BIC) method to give a mixed Gaussian model a penalty function for controlling over-fitting of the model. The final mixed Gaussian model is shown in equation (19):
BIC(e s,f s;γ,κ)=-2×log L(f s;γ,κ)-2×log L(e s;γ,κ)+I×log(N)+J×log(M)  (19) BIC(e s ,f s ;γ,κ)=-2×log L(f s ;γ,κ)-2×log L(e s ;γ,κ)+I×log(N)+J×log (M) (19)
其中,BIC(S s,f s;γ,κ)表示混合模型的似然函数,I是公式(17)中高斯分布的个数,J是公式(18)中高斯分布的个数。N是基因组中窗口的数量,M是基因组中HGSNV的个数。 Among them, BIC(S s , f s ; γ, κ) represents the likelihood function of the mixed model, I is the number of Gaussian distributions in equation (17), and J is the number of Gaussian distributions in equation (18). N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
通过遍历[P-0.02,P+0.02]的区间,对所有的(P,Q n)求极大似然估计,可以得到最合适的P值与Q值然后根据公式(10)和(11)可计算癌症样本的纯度和染色体倍性。 By traversing the interval of [P-0.02, P+0.02], the maximum likelihood estimation is obtained for all (P, Q n ), and the most suitable P value and Q value can be obtained and then according to formulas (10) and (11). The purity and chromosome ploidy of the cancer sample can be calculated.
因此,本发明一方面提供了一种用于计算癌症样本中癌症细胞纯度和染色体倍性的方法,所述方法包括以下步骤:Accordingly, one aspect of the invention provides a method for calculating cancer cell purity and ploidy ploidy in a cancer sample, the method comprising the steps of:
步骤A:Step A:
获取配对的(来自同一癌症病人的)癌症组织样本和正常组织样本的全基因组测序 (WGS)数据,并将测序数据比对到参考基因组;Obtaining genome-wide sequencing (WGS) data from paired cancer tissue samples (from the same cancer patient) and normal tissue samples, and comparing the sequencing data to the reference genome;
步骤B:Step B:
从步骤A得到的比对结果文件中,提取read位置和长度信息,HGSNV位点和覆盖该位点的read数量信息,计算所有HGSNV的MAF,其中,计算公式如(1.1)所示:From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):
Figure PCTCN2018078908-appb-000028
Figure PCTCN2018078908-appb-000028
公式(1.1)中,n r为包含与参考基因组相同等位基因的read数量,n a为包含另一种等位基因的read的数量,n t表示覆盖该HGSNV位点的总read数量,C为该HGSNV的MAF值; In formula (1.1), n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;
步骤C:Step C:
根据步骤B得到的read位置和长度信息,以window为单位统计各window内包含的read数量,使用基因组GC含量校正所有window内read数量;According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;
步骤D:Step D:
使用步骤C校正后的read数量,使用公式(1)计算每一个window的TRE,然后运用TRE,通过BIC-seq软件对基因组进行片段化,获得以拷贝数划分的基因组片段:Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:
Figure PCTCN2018078908-appb-000029
Figure PCTCN2018078908-appb-000029
公式(1)中,
Figure PCTCN2018078908-appb-000030
Figure PCTCN2018078908-appb-000031
分别表示在癌症样本中覆盖片段s(这里指window)的read数量和在正常样本中覆盖片段s的read数量,N t表示癌症样本总read数量,N n表示相应正常样本总read数量,e s为TRE值;
In formula (1),
Figure PCTCN2018078908-appb-000030
with
Figure PCTCN2018078908-appb-000031
Represents the number of reads in the cancer sample covering the segment s (here the window) and the number of reads covering the segment s in the normal sample, N t represents the total number of reads of the cancer sample, and N n represents the total number of reads of the corresponding normal sample, e s Is the TRE value;
步骤E:Step E:
以步骤D中BIC-seq处理后的基因组片段为单位,统计片段内所有window的TRE的均值、方差和该片段内window数量,根据均值和方差对基因组每个片段的window数量进行平滑化(smooth)处理,使TRE的分布更均匀,然后将平滑化处理后所有片段的window分布汇总,得到基因组上window随TRE变化的分布结果;同时以片段为单位,计算片段中所有HGSNV的MAF的均值和方差;Based on the genomic fragments treated by BIC-seq in step D, the mean, variance, and number of windows in the window of all windows in the segment are counted, and the number of windows per segment of the genome is smoothed according to the mean and variance (smooth Processing, the distribution of TRE is more uniform, and then the window distribution of all segments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean value of MAF of all HGSNVs in the segment is calculated in units of fragments. variance;
步骤F:Step F:
使用如公式(12)、(13)所示的类自回归模型,计算相邻拷贝数片段内TRE的差值即P,具体方法为遍历一定范围的P,计算Y(P),在Y(P)的分布中,选择第二高峰内Y(P)的最大值对应的P作为P的计算结果:Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, by traversing a certain range of P, and calculating Y(P), at Y ( In the distribution of P), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:
Figure PCTCN2018078908-appb-000032
Figure PCTCN2018078908-appb-000032
Figure PCTCN2018078908-appb-000033
Figure PCTCN2018078908-appb-000033
公式(12)和(13)中,X t表示0到M t之间的TRE值;t表示扩大了1000倍的TRE值;M t表示TRE的最大值;变量P表示两个TRE位点的间隔;C(X t)表示在TRE为X t的位点,对应的window数量;C(X t+1000×P)表示在TRE为X t+1000×P的位点,对应的window数量;Y(P)表示在变量P下,类自回归模型的函数值; In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) represents the number of windows corresponding to the position where TRE is X t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;
步骤G:Step G:
根据步骤F得到的P,计算TRE分布中第一个实际观测peak的TRE均值,然后计算在第一个实际peak之前最多可能存在理论peak的数量N,最后当第一个实际peak之前存在n个理论peak时,计算Q的值,以Q n表示,其中步骤G可以包括: According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q n , where step G may include:
G1:G1:
根据步骤F计算的P,使用公式(13.1),选取使公式(13.1)取最大值的X f作为第一个实际观测peak的TRE均值: According to the P calculated in step F, using formula (13.1), select X f which takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:
Figure PCTCN2018078908-appb-000034
Figure PCTCN2018078908-appb-000034
公式(13.1)中,i表示第i个peak,C(X f+P×i)表示在TRE为X f+P×i的位点,对应的window数量,n表示M t以内peak的最大数量,M t表示TRE的最大值; In formula (13.1), i represents the ith peak, C(X f + P × i) represents the position where TRE is X f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M t , M t represents the maximum value of TRE;
G2:G2:
使用公式(13.2),根据步骤F计算的P和步骤G1计算的X f,计算在X f之前最多可能存在的peak数量N: Using formula (13.2), calculate the maximum number of peaks N that may exist before X f based on P calculated in step F and X f calculated in step G1:
Figure PCTCN2018078908-appb-000035
Figure PCTCN2018078908-appb-000035
公式(13.2)中,X f表示第一个peak的均值,P表示相邻拷贝数片段对应的peak之间的间距,floor表示向下取整数; In formula (13.2), X f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;
G3:G3:
利用步骤G2计算的N值,当n取0到N之间的整数时,使用公式(13.3)计算Q n的值: Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q n is calculated using equation (13.3):
Q n=X f-n×P+2×P=X f+(2-n)×P,n∈[0,N]  (13.3) Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3)
公式(13.3)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,Q n表示在X f之前理论上存在n个peak时的Q值; In formula (13.3), n represents the number of peaks before X f , the value ranges from 0 to N, P represents the spacing between peaks of adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, Q n represents the Q value when there are theoretically n peaks before X f ;
步骤H:Step H:
使用步骤F计算的P与步骤G计算的所有可能的Q n,使用公式(10)、(11)计算癌 症样本纯度γ和染色体倍性κ: Using the P calculated in step F and all possible Q n calculated in step G, the cancer sample purity γ and chromosome ploidy κ are calculated using equations (10), (11):
Figure PCTCN2018078908-appb-000036
Figure PCTCN2018078908-appb-000036
Figure PCTCN2018078908-appb-000037
Figure PCTCN2018078908-appb-000037
公式(10)、(11)中,γ表示样本纯度,κ表示染色体倍性,那么对所有的(P,Q N)都可以得到对应的(γ,κ); In formulas (10) and (11), γ represents the purity of the sample, and κ represents the ploidy of the chromosome, so that the corresponding (γ, κ) can be obtained for all (P, Q N );
步骤I:Step I:
当n取[0,N]之间的某个整数值时,使用公式(13.4)计算第i个peak的TRE均值:When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:
T i=X f-n×P+i×P=X f+(i-n)×P,n∈[0,N]  (13.4) T i =X f -n×P+i×P=X f +(in)×P,n∈[0,N] (13.4)
公式(13.4)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,T i表示第i个peak的TRE均值, In formula (13.4), n represents the number of peaks before X f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, T i represents the TRE mean of the ith peak,
对于落在T i附近的片段,认为该片段具有拷贝数i;对于没有落在T i附近的片段,将其归类为亚克隆片段,在后续分析中剔除所有亚克隆片段;然后根据步骤H计算的癌症样本纯度γ和peak对应的拷贝数,可计算peak的MAF的期望f b,不同peak的MAF期望不同,对基因组上的所有peak,最终得到MAF期望的集合{f b};同时计算各个peak的TRE均值和方差(或标准差); For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity γ and peak, calculate the expected f b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, finally obtain the desired set of MAF {f b }; TRE mean and variance (or standard deviation) for each peak;
步骤J:Step J:
根据步骤F计算的P和步骤I计算的{f b}构建如公式(19)所示的用“贝叶斯信息准则”校正后的混合高斯分布模型,然后对模型极大似然估计;其中,步骤J可以包括如下几步: According to P calculated in step F and {f b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J can include the following steps:
J1:J1:
以步骤F计算的P构建如公式(17)所示的高斯分布模型:Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:
Figure PCTCN2018078908-appb-000038
Figure PCTCN2018078908-appb-000038
公式(17)中,L(e s;γ,κ)表示基因组片段TRE的似然函数,N表示基因组上的所有window的数量,I表示基因组中所有片段的最大的拷贝数,σ i表示拷贝数为i的所有片段的TRE的标准差由步骤I得到,e s为第s个window的TRE观测值,S i表示第i个peak的TRE均值即步骤I中的T i,p i表示第s个window的拷贝数为i的权重,对所有的i,p i均取值为1; In equation (17), L(e s ; γ, κ) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e s is the TRE observation value of the sth window, and S i represents the TRE mean value of the i th peak, that is, T i in step I, p i represents the first The copy number of s windows is the weight of i, and the value of all i, p i is 1;
J2:J2:
以步骤I计算的f b构建如公式(18)所示的高斯分布模型: Construct a Gaussian distribution model as shown in equation (18) with f b calculated in step I:
Figure PCTCN2018078908-appb-000039
Figure PCTCN2018078908-appb-000039
公式(18)中,L(f s;γ,κ)表示HGSNV的似然函数,M表示基因组中所有HGSNV数量,S表示第S个HGSNV,I表示基因组中所有片段的最大的拷贝数;F i,j表示拷贝数为i,主要等位基因的拷贝数为j的片段内HGSNV的MAF期望值,由步骤I得到;f s表示该片段内所有HGSNV的MAF的观测值均值,由步骤E得到,σ i,j表示该片段内所有HGSNV的MAF观测值的标准差,由步骤E得到;p i,j表示在主要等位基因的拷贝数为j时,高斯分布的权重,对所有的i和j,p i,j取值均为1,p i表示第S个HGSNV所在片段的拷贝数为i的权重,对所有的i,p i取值均为1; In formula (18), L(f s ; γ, κ) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E , σ i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1, p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p i is 1;
J3:J3:
将(17)与(18)相加得到混合高斯模型,然后对混合模型进行BIC(Bayesian Information Criterion)校正得到最终混合模型如公式(19):Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC (Bayesian Information Criterion) correction on the mixed model to obtain the final mixed model as Equation (19):
BIC(e s,f s;γ,κ)=-2×log L(f s;γ,κ)-2×log L(e s;γ,κ)+I×log(N)+J×log(M)  (19) BIC(e s ,f s ;γ,κ)=-2×log L(f s ;γ,κ)-2×log L(e s ;γ,κ)+I×log(N)+J×log (M) (19)
公式(19)中,BIC(e s,f s;γ,κ)表示混合模型的似然函数,I表示基因组中所有片段的最大的拷贝数,J是公式(18)中j的取值个数,N是基因组中window的数量,M是基因组中HGSNV的个数, In the formula (19), BIC(e s , f s ; γ, κ) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
对[0,N]范围内的每一个整数值n,可以通过步骤G得到Q n,也可以通过步骤I得到所有peak的MAF期望的集合{f b},而一对(P,{f b})可以构建一个公式(19)所示的模型,实质上是对每一对(P,Q n),可以构建一个公式(19)所示的模型; For each integer value n in the range [0, N], Q n can be obtained by step G, or the desired set of MAFs of all peaks {f b } can be obtained by step I, and a pair (P, {f b }) It is possible to construct a model shown in equation (19), essentially for each pair (P, Q n ), to construct a model as shown in equation (19);
步骤K:Step K:
以0.001为分辨率,对[P-m,P+m]区间的所有P值,重复步骤G~J,可以得到一系列不同的(P,Q n)与对应的似然函数值,取最大的似然函数值对应的(P,Q n)作为最合适的P和Q值,m是0到0.5之间的一个值; With 0.001 as the resolution, repeating steps G to J for all P values in the [Pm, P+m] interval, a series of different (P, Q n ) and corresponding likelihood function values can be obtained, taking the maximum The function value corresponds to (P, Q n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;
步骤L:Step L:
查询步骤H的结果,可以找到在步骤K得到的(P,Q)下,对应的癌症样本纯度和染色体倍性。Looking at the results of step H, the corresponding cancer sample purity and chromosome ploidy can be found under (P, Q) obtained in step K.
另一方面,本发明提供了一种用于计算癌症样本中癌症细胞纯度和染色体倍性的装置,其包括处理器,所述处理器用于运行程序,所述程序运行时执行以下步骤:In another aspect, the invention provides an apparatus for calculating cancer cell purity and chromosome ploidy in a cancer sample, comprising a processor for running a program, the program running the following steps:
步骤A:Step A:
获取配对的(来自同一癌症病人的)癌症组织样本和正常组织样本的全基因组测序(WGS)数据,并将测序数据比对到参考基因组;Obtaining genome-wide sequencing (WGS) data from paired cancer tissue samples (from the same cancer patient) and normal tissue samples, and comparing the sequencing data to a reference genome;
步骤B:Step B:
从步骤A得到的比对结果文件中,提取read位置和长度信息,HGSNV位点和覆盖该位点的read数量信息,计算所有HGSNV的MAF,其中,计算公式如(1.1)所示:From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):
Figure PCTCN2018078908-appb-000040
Figure PCTCN2018078908-appb-000040
公式(1.1)中,n r为包含与参考基因组相同等位基因的read数量,n a为包含另一种等位基因的read的数量,n t表示覆盖该HGSNV位点的总read数量,C为该HGSNV的MAF值; In formula (1.1), n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;
步骤C:Step C:
根据步骤B得到的read位置和长度信息,以window为单位统计各window内包含的read数量,使用基因组GC含量校正所有window内read数量;According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;
步骤D:Step D:
使用步骤C校正后的read数量,使用公式(1)计算每一个window的TRE,然后运用TRE,通过BIC-seq软件对基因组进行片段化,获得以拷贝数划分的基因组片段:Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:
Figure PCTCN2018078908-appb-000041
Figure PCTCN2018078908-appb-000041
公式(1)中,
Figure PCTCN2018078908-appb-000042
Figure PCTCN2018078908-appb-000043
分别表示在癌症样本中覆盖片段s(这里指window)的read数量和在正常样本中覆盖片段s的read数量,N t表示癌症样本总read数量,N n表示相应正常样本总read数量,e s为TRE值;
In formula (1),
Figure PCTCN2018078908-appb-000042
with
Figure PCTCN2018078908-appb-000043
Represents the number of reads in the cancer sample covering the segment s (here the window) and the number of reads covering the segment s in the normal sample, N t represents the total number of reads of the cancer sample, and N n represents the total number of reads of the corresponding normal sample, e s Is the TRE value;
步骤E:Step E:
以步骤D中BIC-seq处理后的基因组片段为单位,统计片段内所有window的TRE的均值、方差和该片段内window数量,根据均值和方差对基因组每个片段的window数量进行平滑化(smooth)处理,使TRE的分布更均匀,然后将平滑化处理后所有片段的window分布汇总,得到基因组上window随TRE变化的分布结果;同时以片段为单位,计算片段中所有HGSNV的MAF的均值和方差;Based on the genomic fragments treated by BIC-seq in step D, the mean, variance, and number of windows in the window of all windows in the segment are counted, and the number of windows per segment of the genome is smoothed according to the mean and variance (smooth Processing, the distribution of TRE is more uniform, and then the window distribution of all segments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean value of MAF of all HGSNVs in the segment is calculated in units of fragments. variance;
步骤F:Step F:
使用如公式(12)、(13)所示的类自回归模型,计算相邻拷贝数片段内TRE的差值即P,具体方法为遍历一定范围的P,计算Y(P),在Y(P)的分布中,选择第二高峰内Y(P)的最大值对应的P作为P的计算结果:Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, by traversing a certain range of P, and calculating Y(P), at Y ( In the distribution of P), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:
Figure PCTCN2018078908-appb-000044
Figure PCTCN2018078908-appb-000044
Figure PCTCN2018078908-appb-000045
Figure PCTCN2018078908-appb-000045
公式(12)和(13)中,X t表示0到M t之间的TRE值;t表示扩大了1000倍的TRE值;M t表示TRE的最大值;变量P表示两个TRE位点的间隔;C(X t)表示在TRE为X t的位点,对应的window数量;C(X t+1000×P)表示在TRE为X t+1000×P的位点,对应的window数量;Y(P)表示在变量P下,类自回归模型的函数值; In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) represents the number of windows corresponding to the position where TRE is X t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;
步骤G:Step G:
根据步骤F得到的P,计算TRE分布中第一个实际观测peak的TRE均值,然后计算在第一个实际peak之前最多可能存在理论peak的数量N,最后当第一个实际peak之前存在n个理论peak时,计算Q的值,以Q n表示,其中步骤G可以包括: According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q n , where step G may include:
G1:G1:
根据步骤F计算的P,使用公式(13.1),选取使公式(13.1)取最大值的X f作为第一个实际观测peak的TRE均值: According to the P calculated in step F, using formula (13.1), select X f which takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:
Figure PCTCN2018078908-appb-000046
Figure PCTCN2018078908-appb-000046
公式(13.1)中,i表示第i个peak,C(X f+P×i)表示在TRE为X f+P×i的位点,对应的window数量,n表示M t以内peak的最大数量,M t表示TRE的最大值; In formula (13.1), i represents the ith peak, C(X f + P × i) represents the position where TRE is X f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M t , M t represents the maximum value of TRE;
G2:G2:
使用公式(13.2),根据步骤F计算的P和步骤G1计算的X f,计算在X f之前最多可能存在的peak数量N: Using formula (13.2), calculate the maximum number of peaks N that may exist before X f based on P calculated in step F and X f calculated in step G1:
Figure PCTCN2018078908-appb-000047
Figure PCTCN2018078908-appb-000047
公式(13.2)中,X f表示第一个peak的均值,P表示相邻拷贝数片段对应的peak之间的间距,floor表示向下取整数; In formula (13.2), X f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;
G3:G3:
利用步骤G2计算的N值,当n取0到N之间的整数时,使用公式(13.3)计算Q n的值: Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q n is calculated using equation (13.3):
Q n=X f-n×P+2×P=X f+(2-n)×P,n∈[0,N] (13.3) 公式(13.3)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,Q n表示在X f之前理论上存在n个peak时的Q值; Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3) In the formula (13.3), n represents the number of peaks before X f , The value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, X f represents the TRE mean of the first actually observed peak, and Q n represents the theoretical existence before X f Q value at n peaks;
步骤H:Step H:
使用步骤F计算的P与步骤G计算的所有可能的Q n,使用公式(10)、(11)计算癌症样本纯度γ和染色体倍性κ: Using the P calculated in step F and all possible Q n calculated in step G, the cancer sample purity γ and chromosome ploidy κ are calculated using equations (10), (11):
Figure PCTCN2018078908-appb-000048
Figure PCTCN2018078908-appb-000048
Figure PCTCN2018078908-appb-000049
Figure PCTCN2018078908-appb-000049
公式(10)、(11)中,γ表示样本纯度,κ表示染色体倍性,那么对所有的(P,Q N)都可以得到对应的(γ,κ); In formulas (10) and (11), γ represents the purity of the sample, and κ represents the ploidy of the chromosome, so that the corresponding (γ, κ) can be obtained for all (P, Q N );
步骤I:Step I:
当n取[0,N]之间的某个整数值时,使用公式(13.4)计算第i个peak的TRE均值:When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:
T i=X f-n×P+i×P=X f+(i-n)×P,n∈[0,N]  (13.4) T i =X f -n×P+i×P=X f +(in)×P,n∈[0,N] (13.4)
公式(13.4)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,T i表示第i个peak的TRE均值。 In formula (13.4), n represents the number of peaks before X f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, T i represents the TRE mean of the ith peak.
对于落在T i附近的片段,认为该片段具有拷贝数i;对于没有落在T i附近的片段,将其归类为亚克隆片段,在后续分析中剔除所有亚克隆片段;然后根据步骤H计算的癌症样本纯度γ和peak对应的拷贝数,可计算peak的MAF的期望f b,不同peak的MAF期望不同,对基因组上的所有peak,最终得到MAF期望的集合{f b};。同时计算各个peak的TRE均值和方差(或标准差); For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculated the copy number of the cancer sample purity γ and peak, the expected f b of the MAF of peak can be calculated, the MAF of different peaks is expected to be different, and for all peaks on the genome, the desired set of MAF {f b }; Calculate the TRE mean and variance (or standard deviation) of each peak at the same time;
步骤J:Step J:
根据步骤F计算的P和步骤I计算的{f b}构建如公式(19)所示的用“贝叶斯信息准则”校正后的混合高斯分布模型,然后对模型极大似然估计;其中,步骤J可以包括如下几步: According to P calculated in step F and {f b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J can include the following steps:
J1:J1:
以步骤F计算的P构建如公式(17)所示的高斯分布模型:Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:
Figure PCTCN2018078908-appb-000050
Figure PCTCN2018078908-appb-000050
公式(17)中,L(e s;γ,κ)表示基因组片段TRE的似然函数,N表示基因组上的所有window的数量,I表示基因组中所有片段的最大的拷贝数,σ i表示拷贝数为i的所有片段的TRE的标准差由步骤I得到,e s为第s个window的TRE观测值,S i表示第i个peak的TRE均值即步骤I中的T i,p i表示第s个window的拷贝数为i的权重,对所有的i,p i均取值为1; In equation (17), L(e s ; γ, κ) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e s is the TRE observation value of the sth window, and S i represents the TRE mean value of the i th peak, that is, T i in step I, p i represents the first The copy number of s windows is the weight of i, and the value of all i, p i is 1;
J2:J2:
以步骤I计算的f b构建如公式(18)所示的高斯分布模型: Construct a Gaussian distribution model as shown in equation (18) with f b calculated in step I:
Figure PCTCN2018078908-appb-000051
Figure PCTCN2018078908-appb-000051
公式(18)中,L(f s;γ,κ)表示HGSNV的似然函数,M表示基因组中所有HGSNV数量,S表示第S个HGSNV,I表示基因组中所有片段的最大的拷贝数;F i,j表示拷贝数为i,主要等位基因的拷贝数为j的片段内HGSNV的MAF期望值,由步骤I得到;f s表示该片段内所有HGSNV的MAF的观测值均值,由步骤E得到,σ i,j表示该片段内所有HGSNV的MAF观测值的标准差,由步骤E得到;p i,j表示在主要等位基因的拷贝数为j时,高斯分布的权重,对所有的i和j,p i,j取值均为1,p i表示第S个HGSNV所在片段 的拷贝数为i的权重,对所有的i,p i取值均为1; In formula (18), L(f s ; γ, κ) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E , σ i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1, p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p i is 1;
J3:J3:
将(17)与(18)相加得到混合高斯模型,然后对混合模型进行BIC(Bayesian Information Criterion)校正得到最终混合模型如公式(19):Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC (Bayesian Information Criterion) correction on the mixed model to obtain the final mixed model as Equation (19):
BIC(e s,f s;γ,κ)=-2×log L(f s;γ,κ)-2×log L(e s;γ,κ)+I×log(N)+J×log(M) (19) BIC(e s ,f s ;γ,κ)=-2×log L(f s ;γ,κ)-2×log L(e s ;γ,κ)+I×log(N)+J×log (M) (19)
公式(19)中,BIC(e s,f s;γ,κ)表示混合模型的似然函数,I表示基因组中所有片段的最大的拷贝数,J是公式(18)中j的取值个数,N是基因组中window的数量,M是基因组中HGSNV的个数。 In the formula (19), BIC(e s , f s ; γ, κ) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
对[0,N]范围内的每一个整数值n,可以通过步骤G得到Q n,也可以通过步骤I得到所有peak的MAF期望的集合{f b},而一对(P,{f b})可以构建一个公式(19)所示的模型,实质上是对每一对(P,Q n),可以构建一个公式(19)所示的模型; For each integer value n in the range [0, N], Q n can be obtained by step G, or the desired set of MAFs of all peaks {f b } can be obtained by step I, and a pair (P, {f b }) It is possible to construct a model shown in equation (19), essentially for each pair (P, Q n ), to construct a model as shown in equation (19);
步骤K:Step K:
以0.001为分辨率,对[P-m,P+m]区间的所有P值,重复步骤G~J,可以得到一系列不同的(P,Q n)与对应的似然函数值,取最大的似然函数值对应的(P,Q n)作为最合适的P和Q值,m是0到0.5之间的一个值; With 0.001 as the resolution, repeating steps G to J for all P values in the [Pm, P+m] interval, a series of different (P, Q n ) and corresponding likelihood function values can be obtained, taking the maximum The function value corresponds to (P, Q n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;
步骤L:Step L:
查询步骤H的结果,可以找到在步骤K得到的(P,Q)下,对应的癌症样本纯度和染色体倍性。Looking at the results of step H, the corresponding cancer sample purity and chromosome ploidy can be found under (P, Q) obtained in step K.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤A中,采用1000基因组计划第三期(phase 3)项目使用的参考基因组hs37d5(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)作为本发明的参考基因组,它包含了GRCh37中的所有染色体和零散序列(decoy sequences)。比对软件使用Burrows-Wheeler Aligner(BWA),比对方法使用其中的bwa mem,最终获得癌症和正常样本的比对结果bam格式文件。As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step A, the reference genome hs37d5 (ftp://ftp) used in the phase 3 project of the 1000 genome project is adopted. .1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz) As the reference genome of the present invention, it contains all chromosomes and decoy sequences in GRCh37. The comparison software uses Burrows-Wheeler Aligner (BWA), and the comparison method uses bwa mem, and finally obtains the bam format file of the comparison result of cancer and normal samples.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤B中,采用了samtools软件提取read的位置和长度信息,HGSNV的位点和覆盖该位点的read数量信息。使用samtools view命令提取read信息时,过滤掉序列比对质量(MAPQ)低于31的序列(参数-q 31,q表示过滤掉测序质量差的序列),同时过滤掉未能正确匹配的read(参数-f 0x2 -F 0x18,f表示提取符合一定要求的序列,F表示过滤符合一定要求的序列)。使用samtools mpileup命令提取HGSNV信息时,过滤掉序列比对质量(MAPQ)低于20的序列(参数-q 20),并过滤掉碱基质量小于20的序列(参数-Q 20,Q表示过滤掉碱基质量差的序列)。选取等位基因频率(allele frequence)时,本发明使用samtools mpileup的-1参数。使用该参数需要提前准备一个包含SNP位点信息的bed格式文件。本 发明方法提前收集了1000基因组(genome)计划(http://www.internationalgenome.org/)中,根据大量样本统计出来的杂合等位基因位点,并且过滤掉B-等位基因频率(B-allele frequence)小于0.05的位点,然后做成bed文件。使用“-1”参数在确保能提供充足的HGSNV位点基础上,大大加快了HGSNV位点的提取速度,提高了装置运行效率。As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in step B, samtools software is used to extract the position and length information of the read, the HGSNV site and the read covering the site. Quantity information. When the read information is extracted using the samtools view command, the sequence whose sequence alignment quality (MAPQ) is lower than 31 is filtered out (parameter -q 31, q indicates that the sequence with poor sequencing quality is filtered out), and the read that fails to match correctly is filtered out. The parameter -f 0x2 -F 0x18, f indicates that the sequence that meets certain requirements is extracted, and F indicates that the sequence that meets certain requirements is filtered. When extracting HGSNV information using the samtools mpileup command, filter out the sequence with sequence alignment quality (MAPQ) lower than 20 (parameter -q 20), and filter out the sequence with base quality less than 20 (parameter -Q 20, Q means filtered out) Sequence of poor base quality). When the allele frequence is selected, the present invention uses the -1 parameter of samtools mpileup. To use this parameter, you need to prepare a bed format file containing SNP location information in advance. The method of the present invention collects in advance the 1000 genome (genome) program (http://www.internationalgenome.org/), the heterozygous allele locus based on a large number of samples, and filters out the B-allele frequency ( B-allele frequence) is less than 0.05 and is then made into a bed file. The use of the "-1" parameter greatly accelerates the extraction speed of the HGSNV site and improves the operating efficiency of the device on the basis of ensuring sufficient HGSNV sites.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤C中,步骤C可以包括4步:As a preferred solution, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step C, the step C may include four steps:
C1、将全基因组按照一定碱基长度的window为单位进行划分,对每个window统计覆盖该window的read数量,统计时以每条read的中点代表该read的位置;C1, the whole genome is divided according to a window of a certain base length, and the number of reads of the window is covered for each window statistics, and the position of the read is represented by the midpoint of each read in the statistics;
C2、对参考基因组创建索引文件,提高GC含量的统计速度;C2. Create an index file for the reference genome to increase the statistical speed of the GC content;
C3、以每个window的GC含量为自变量,以每个window的read数量为因变量,拟合read数量随GC含量变化的函数;C3, taking the GC content of each window as an independent variable, taking the number of reads of each window as the dependent variable, and fitting the function of the read quantity with the GC content;
C4、使用拟合出的模型对全基因组read数量进行调整。C4. Adjust the number of whole genome reads using the fitted model.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤C2中,本发明为参考基因组创建GC含量索引文件。对每一条染色体分别统计1、5、25、125个碱基间隔的区域内,鸟嘌呤(G)和胞嘧啶(C)的累积数量。那么在统计某一个window中的GC含量时,可以用a*125+b*25+c*5+d*1(其中a,b,c,d表示系数变量)的快速算法提取。例如想要统计某380bp区域内的GC含量时,可以分解为3*125+1*5形式,那么只需要读取特定索引文件的某个区域5碱基中的GC含量和某个区域125碱基中的GC含量即可。同时本发明将索引文件存储为二进制的格式,极大的加快了对特定区域的GC含量的提取。As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step C2, the present invention creates a GC content index file for the reference genome. The cumulative number of guanine (G) and cytosine (C) in the region of 1, 5, 25, and 125 base intervals for each chromosome. Then, when counting the GC content in a window, you can use the fast algorithm to extract a*125+b*25+c*5+d*1 (where a, b, c, d represent coefficient variables). For example, if you want to count the GC content in a 380 bp region, you can decompose it into 3*125+1*5 format. Then you only need to read the GC content in a certain base of a certain index file and the 125 base in a certain region. The GC content in the base can be. At the same time, the present invention stores the index file in a binary format, which greatly speeds up the extraction of the GC content of a specific region.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤C3中,本发明使用步骤C1和步骤C2提取的各window GC含量,通过如下弹性网络模型拟合read数量随GC含量变化。本发明使用window的GC含量为变量x,使用x,x 2,x 3,x 4,x 5,x 6作为弹性网络模型的输入变量,以read数量为输出变量,构建弹性网络模型如公式(20)所示。式中,y表示window内观测到的read数量,X表示输入变量矩阵,β表示变量系数矩阵,j表示变量系数下标,P表示系数总数,λ 1和λ 2表示罚分系数。 As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of cancer samples, in the step C3, the present invention uses the respective window GC contents extracted in steps C1 and C2 to fit the read through the following elastic network model. The amount varies with GC content. The invention uses the GC content of the window as the variable x, uses x, x 2 , x 3 , x 4 , x 5 , x 6 as the input variables of the elastic network model, and uses the number of reads as the output variable to construct an elastic network model such as a formula ( 20) shown. Where y represents the number of reads observed in the window, X represents the input variable matrix, β represents the variable coefficient matrix, j represents the variable coefficient subscript, P represents the total number of coefficients, and λ 1 and λ 2 represent the penalty coefficients.
Figure PCTCN2018078908-appb-000052
Figure PCTCN2018078908-appb-000052
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤C4中,使用步骤C3中的模型预测每一个window理论上的read数量μ gc,基因组的平均GC含量定义为μ,window内观测到的read数量定义为y,window内校正后的read数量为Y。那么校正公式如下(21)所示: As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step C4, the model in step C3 is used to predict the theoretical read number μ gc of each window, and the average GC content of the genome. Defined as μ, the number of reads observed in the window is defined as y, and the number of corrected reads in the window is Y. Then the correction formula is as follows (21):
Figure PCTCN2018078908-appb-000053
Figure PCTCN2018078908-appb-000053
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤D中,本发明使用公式(1)计算每个window的TRE取值。然后运用TRE的值,使用BIC-seq软件对全基因组进行片段化(segmentation)。BIC-seq的思路是使用Bayesian Information Criterion(BIC)算法,统计相邻窗口的BIC值,值越小说明两个窗口越相似,然后将BIC值小于0的window合并,最终BIC-seq会按照片段拷贝数的差异,将全基因组分割为不同片段。每一个片段与相邻片段有不同的TRE均值,即拷贝数存在差异。As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step D, the present invention calculates the TRE value of each window using the formula (1). The whole genome was then segmented using the BIC-seq software using the value of TRE. The idea of BIC-seq is to use the Bayesian Information Criterion (BIC) algorithm to count the BIC values of adjacent windows. The smaller the value, the more similar the two windows are, and then merge the windows with BIC values less than 0. Finally, BIC-seq will follow the fragment. The difference in copy number divides the whole genome into different segments. Each segment has a different TRE mean value than the adjacent segment, that is, there is a difference in copy number.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤E中,使用步骤D中BIC-seq处理后的基因组片段为单位,计算片段所包含的window数量,TRE的平均值以及方差。然后对片段的TRE进行smooth处理。处理方式如公式(22)所示。针对每一个基因组片段,以TRE的均值作为正态分布的均值μ,以TRE的方差作为正态分布的方差σ,计算出TRE在[μ-2σ,μ+2σ]范围内window数量的分布,定义v为TRE坐标,取值范围为[μ-2σ,μ+2σ],分辨率为0.000,C win为该片段分配到v位点的window数量,C T表示该片段内window的总数。将所有片段的window根据TRE值smooth后,可使片段内的window数量呈现正态分布,对所有片段各TRE位点对应的window数求和汇总,可以得到基因组范围的window随TRE变化的分布。 As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step E, the number of windows included in the fragment is calculated by using the genomic fragment after the BIC-seq processing in the step D, The mean and variance of the TRE. The TRE of the segment is then subjected to smooth processing. The processing method is as shown in formula (22). For each genomic fragment, the mean value of TRE is taken as the mean μ of the normal distribution, and the variance of TRE is taken as the variance σ of the normal distribution, and the distribution of the window number of TRE in the range of [μ-2σ, μ+2σ] is calculated. The definition v is the TRE coordinate, the value range is [μ-2σ, μ+2σ], the resolution is 0.000, C win is the number of windows allocated to the v-site, and C T is the total number of windows in the segment. After the window of all segments is based on the TRE value, the number of windows in the segment is normally distributed, and the number of windows corresponding to each TRE site of all segments is summed and summarized, and the distribution of the window-wide window with the TRE can be obtained.
Figure PCTCN2018078908-appb-000054
Figure PCTCN2018078908-appb-000054
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤F中,以0.001为分辨率,遍历(0,1]范围内的所有P,使用类自回归模型,计算Y(P)的值。Y(P)表现为多峰分布,类似图3所示,图中横轴为P,纵轴表示Y(P),本发明使用第二高峰内Y(P)的最大值对应的P作为P的计算结果,M t是TRE的最大取值,这里将M t设置为3。 As a preferred embodiment, in the above method and apparatus for calculating the purity and ploidy of a cancer sample, in the step F, traversing all Ps in the range of (0, 1) with a resolution of 0.001, using an autoregressive model. Calculate the value of Y(P). Y(P) appears as a multimodal distribution, similar to that shown in Figure 3, where the horizontal axis is P and the vertical axis represents Y(P), and the present invention uses the second peak Y (P) The maximum value corresponds to P as the calculation result of P, and M t is the maximum value of TRE, where M t is set to 3.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤G中,步骤G包括3个步骤,步骤G1中,[0,1]的TRE区间作为变量X f的取值范围,过滤掉C(X f)小于1000的TRE位点,计算使公式(13.1)取最大值时的X f作为第一个peak的均值点。 As a preferred solution, in the above method and apparatus for calculating the purity and chromosome ploidy of a cancer sample, in the step G, the step G includes three steps, and in the step G1, the TRE interval of [0, 1] is used as the variable X f . For the range of values, filter out the TRE site where C(X f ) is less than 1000, and calculate the X f when the maximum value of the formula (13.1) is taken as the mean point of the first peak.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤I中,根据步骤H计算的癌症样本纯度γ和peak对应的拷贝数,可计算peak的MAF的期望f b。其中步骤I可以包含I1、I2、I3三个步骤。 As a preferred embodiment, in the above method and apparatus for calculating cancer sample purity and chromosome ploidy, in the step I, according to the copy number of the cancer sample purity γ and peak calculated according to the step H, the expectation of the MAF of the peak can be calculated. f b . Step I can include three steps of I1, I2, and I3.
I1,使用公式(14)计算peak内HGSNV的MAF理论值,公式(14)中,C mcp表示主要等位基因的拷贝数(major allele copy number),C cp表示peak的整体拷贝数,由步骤I得到,f表示该peak内MAF的理论值,可见当C cp较大时,f有多种不同的可能值。 I1, calculate the MAF theoretical value of HGSNV in peak using formula (14). In formula (14), C mcp represents the major allele copy number, and C cp represents the overall copy number of peak. I get, f represents the theoretical value of the MAF in the peak, it can be seen that when C cp is large, f has many different possible values.
Figure PCTCN2018078908-appb-000055
Figure PCTCN2018078908-appb-000055
I2,利用负二项分布估计覆盖每个HGSNV位点的read总数的概率,使用公式(15)计算负二项分布的概率p和失败次数r。公式(15)中,m是片段内所有window中read数量的均值,v是片段内所有window中read数量的方差,所求得的p是用于负二项分布的随机变量成功的概率,r为随机变量失败的次数,随机变量为覆盖某个HGSNV中的read数量。I2, using the negative binomial distribution to estimate the probability of covering the total number of reads per HGSNV site, and using equation (15) to calculate the probability p and the number of failures r of the negative binomial distribution. In equation (15), m is the mean of the number of reads in all windows in the fragment, and v is the variance of the number of reads in all windows in the fragment. The obtained p is the probability of success of the random variable used for the negative binomial distribution, r For the number of times a random variable fails, the random variable is the number of reads in a certain HGSNV.
Figure PCTCN2018078908-appb-000056
Figure PCTCN2018078908-appb-000056
I3,利用二项分布求得的覆盖某个HGSNV的read数的概率。结合在一定read数量下,HGSNV只有两种基因型,服从二项分布规律,利用公式(16)计算f的校正值f b(即f的期望)。同一个peak中,不同的C mcp可以计算得到不同的f b,选择与该peak的MAF观测均值最接近的f b作为该peak的f b。公式(16)中,k表示在某个HGSNV位点,某一种等位基因(A或B)的数量,d为覆盖该HGSNV的read数量,r为随机变量失败的次数,p是用于负二项分布的随机变量成功的概率; I3, the probability of covering the number of reads of a certain HGSNV obtained by the binomial distribution. In combination with a certain number of reads, HGSNV has only two genotypes, subject to the binomial distribution law, and uses equation (16) to calculate the correction value f b of f (ie, the expectation of f). With a peak, different C mcp can calculate different f b, and selecting the closest mean value of the MAF observed as the peak of the peak is f b f b. In formula (16), k represents the number of alleles (A or B) at a certain HGSNV site, d is the number of reads covering the HGSNV, r is the number of failed random variables, and p is used for The probability of success of a random variable with a negative binomial distribution;
Figure PCTCN2018078908-appb-000057
Figure PCTCN2018078908-appb-000057
对每一个Q n,可推断获得基因组所有peak对应的拷贝数和癌症样本纯度,从而对每一个peak可求f b,进而可以得到所有peak的MAF的期望值得集合{f b}。 For each Q n , it can be inferred that the copy number and cancer sample purity corresponding to all peaks of the genome are obtained, so that f b can be obtained for each peak, and then the expected value set {f b } of the MAF of all peaks can be obtained.
作为一种优选方案,上述计算癌症样本纯度和染色体倍性的方法和装置中,所述步骤K中,m取0.02,P值的遍历区间为[P-0.02,P+0.02]。As a preferred embodiment, in the above method and apparatus for calculating the purity and chromosome ploidy of a cancer sample, in the step K, m is 0.02, and the traversal interval of the P value is [P-0.02, P+0.02].
通过本发明提供的层次混合高斯模型,实现了对癌症样本纯度的快速和准确计算,节约了纯度估算的时间和经济成本,同时提高了计算结果的准确性。Through the hierarchical mixed Gaussian model provided by the invention, the rapid and accurate calculation of the purity of the cancer sample is realized, the time and economic cost of the purity estimation are saved, and the accuracy of the calculation result is improved.
附图说明DRAWINGS
图1表示全基因组中window数量在TRE上的分布。其中,图A表示的是未进过GC校正的TRE分布,图B表示经过GC含量校正后的TRE分布图。Figure 1 shows the distribution of the number of windows in the whole genome on the TRE. Among them, Figure A shows the TRE distribution that has not undergone GC correction, and Figure B shows the TRE distribution after GC content correction.
图2表示一种癌症细胞中的TRE分布的模型,图为smooth处理以后,图中的peak满足以P为周期的分布,少量不满足周期性分布的小peak被认为是亚克隆片段。Q表示拷贝数为2的peak,不存在拷贝数为1的片段,所以在大约0.6的位置的peak的window数量为0。Fig. 2 shows a model of TRE distribution in a cancer cell. The figure shows that after the smooth treatment, the peak in the figure satisfies the distribution with a period of P, and a small number of small peaks which do not satisfy the periodic distribution are considered to be subcloned fragments. Q represents a peak with a copy number of 2, and there is no segment with a copy number of 1, so the number of windows of the peak at a position of about 0.6 is zero.
图3表示横轴为P纵轴为类自回归模型计算值得分布。Figure 3 shows that the horizontal axis is the vertical axis of the P-class autoregressive model.
图4表示本发明方法和装置的流程图。Figure 4 shows a flow chart of the method and apparatus of the present invention.
具体实施方式detailed description
为更好地说明本发明的目的、技术方案和优点,下面将结合附图和具体实施例对本发 明作进一步说明。但是提供实施例仅用于说明目的,而本发明的范围不限于实施例。The present invention will be further described with reference to the accompanying drawings and specific embodiments. However, the examples are provided for illustrative purposes only, and the scope of the invention is not limited to the embodiments.
使用本发明所述的装置计算癌症样本纯度和染色体倍性的流程图如图4所示。A flow chart for calculating cancer sample purity and chromosome ploidy using the apparatus of the present invention is shown in FIG.
实施例中,所使用的实验材料为TCGA(https://cancergenome.nih.gov/)数据库下载的样本(TCGA-AD-A5EJ)的正常组织TCGA-AD-A5EJ-10A和癌症组织TCGA-AD-A5EJ-01A的全基因组测序数据。计算平台为ubuntu 16.04,方法的具体实现为C++,python,R程序。In the examples, the experimental material used was the normal tissue TCGA-AD-A5EJ-10A and the cancer tissue TCGA-AD of the sample (TCGA-AD-A5EJ) downloaded from the TCGA (https://cancergenome.nih.gov/) database. -A5EJ-01A whole genome sequencing data. The computing platform is ubuntu 16.04, and the specific implementation of the method is C++, Python, and R programs.
实施例:根据样本TCGA-CM-4746的癌症组织和正常组织的全基因组测序数据,使用层次性混合高斯模型计算癌症样本的纯度和染色体倍性。EXAMPLES: The purity and ploidy of cancer samples were calculated using a hierarchical mixed Gaussian model based on genome-wide sequencing data for cancer tissues and normal tissues of sample TCGA-CM-4746.
一、收集样本数据,在TCGA中下载TCGA-CM-4746-01A的肿瘤样本和正常样本的全基因组测序数据。癌症样本bam文件大小为12.6G,正常样本bam文件大小为10.1G。将bam文件用PICARD软件处理为fastq文件。将fastq使用bwa mem比对到参考基因组hs37d5得到新的癌症样本和正常样本bam文件,文件大小分别为12.4G和9.9G。1. Collect sample data and download the whole genome sequencing data of tumor samples and normal samples of TCGA-CM-4746-01A in TCGA. The cancer sample bam file size is 12.6G, and the normal sample bam file size is 10.1G. The bam file is processed into a fastq file using the PICARD software. The fastq was compared to the reference genome hs37d5 using bwa mem to obtain a new cancer sample and a normal sample bam file with file sizes of 12.4G and 9.9G, respectively.
二、下载1000genome项目提供的1到22号染色体的vcf文件(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/),使用GATK的SelectVariants方法,提取参考基因组hs37d5.fa中的等位基因频率大于5%的BIALLELIC位点作为潜在的HGSNV位点,最终得到5633774个biallele位点。2. Download the vcf file of chromosomes 1 to 22 provided by the 1000genome project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), and use the SelectVariants method of GATK to extract the reference genome hs37d5. The BIALLELIC site with an allele frequency greater than 5% in .fa serves as a potential HGSNV site, resulting in 5633774 biallele sites.
三、提取正常样本和癌症样本的read信息,同时提取癌症样本的HGSNV信息。使用samtools提取癌症样本的序列覆盖度和HGSNV,得到HGSNV 67732个。提取HGSNV时,采用上述步骤一获得的biallele位点作为备选位点表单。使用samtools view方法,直接从备选表单中提取HGSNV,加快提取速度。Third, extract the read information of normal samples and cancer samples, and extract the HGSNV information of the cancer samples. Using samtools to extract the sequence coverage and HGSNV of cancer samples, 67732 HGSNV were obtained. When extracting HGSNV, the biallele site obtained in the above step 1 is used as an alternative site form. Use the samtools view method to extract HGSNV directly from the alternate form to speed up the extraction.
四、以500bp为window对参考基因组建立GC含量的索引文件,对上述步骤一下载得到的参考基因组hs37d5.fa文件,建立1,5,25,125区段中的GC含量索引文件。存储为二进制格式。4. The index file for establishing the GC content of the reference genome by using the 500 bp window, and the reference genome hs37d5.fa file downloaded in the above step 1, the GC content index file in the 1, 5, 25, 125 segment is established. Stored in binary format.
五、以500bp为一个窗口,统计全基因组范围内每个窗口中的read数量。同时使用步骤四中产生的索引文件计算每个窗口中的GC含量。通过弹性网络模型对read数量进行GC含量校正。5. Using 500bp as a window, count the number of reads in each window in the genome-wide range. Also use the index file generated in step four to calculate the GC content in each window. The GC content was corrected for the number of reads by the elastic network model.
六、对每一个窗口,使用矫正后的read数量计算TRE。并依据TRE,通过BIC-seq对基因组进行片段化。片段化的结果如表一所示,每一列数据表示了一个基因组片段的位置信息和TRE的均值,方差和片段内window的数量。6. For each window, calculate the TRE using the corrected number of reads. The genome was fragmented by BIC-seq according to TRE. The results of the fragmentation are shown in Table 1. Each column of data represents the positional information of a genomic fragment and the mean value of the TRE, the variance and the number of windows in the fragment.
表1 BIC-seq对基因组片段化后的结果Table 1 Results of BIC-seq fragmentation of the genome
染色体号Chromosome number 起始Start 终止termination TRE均值TRE mean TRE方差TRE variance Windows数量Number of Windows
chr1Chr1 1300113001 4526550045265500 0.9445080.944508 0.00147420.0014742 8612886128
chr1Chr1 4526550145265501 8597800085978000 0.9454540.945454 0.001332010.00133201 8097080970
chr1Chr1 8597800185978001 8601150086011500 1.273621.27362 0.09813210.0981321 6868
chr1Chr1 8601150186011501 116069000116069000 0.949150.94915 0.001530580.00153058 5877558775
chr1Chr1 116069001116069001 120339500120339500 1.013231.01323 0.004378910.00437891 84888488
chr1Chr1 120339501120339501 143744000143744000 1.074921.07492 0.0163370.016337 24422442
chr1Chr1 143744001143744001 144707500144707500 1.364611.36461 0.05141540.0514154 469469
chr1Chr1 144707501144707501 145290000145290000 1.469031.46903 0.02527440.0252744 887887
chr1Chr1 145290001145290001 145833000145833000 1.736661.73666 0.02629440.0262944 936936
chr1Chr1 145982001145982001 148248000148248000 1.338951.33895 0.01312230.0131223 23062306
chr1Chr1 148248001148248001 149200000149200000 1.682141.68214 0.03815420.0381542 725725
chr1Chr1 149200001149200001 249240500249240500 1.33831.3383 0.001149540.00114954 196613196613
chr2Chr2 1000110001 4140300041403000 0.9487690.948769 0.001342010.00134201 8186581865
chr2Chr2 4140300141403001 5131700051317000 0.5632510.563251 0.001833060.00183306 1971719717
chr2Chr2 5131700151317001 9170150091701500 0.9503250.950325 0.001458090.00145809 7555075550
chr2Chr2 9170150191701501 9182450091824500 1.234281.23428 0.06052570.0605257 188188
chr2Chr2 9182450191824501 233387500233387500 0.9526520.952652 0.0007443770.000744377 271202271202
chr2Chr2 233387501233387501 243186500243186500 0.9364870.936487 0.003085930.00308593 1903219032
chr3Chr3 6000160001 197956500197956500 0.9521340.952134 0.0006134780.000613478 386567386567
chr4Chr4 1000110001 68900006890000 0.9596570.959657 0.004179210.00417921 1352013520
chr4Chr4 68900016890001 191044500191044500 0.9527030.952703 0.0006336570.000633657 358360358360
chr5Chr5 1150111501 1261600012616000 0.559990.55999 0.001804060.00180406 2489524895
chr5Chr5 1261600112616001 180901000180901000 0.9486630.948663 0.0006765330.000676533 323787323787
chr6Chr6 124001124001 2482700024827000 0.9415090.941509 0.001712210.00171221 4904149041
chr6Chr6 2482700124827001 2598250025982500 0.6125320.612532 0.005445560.00544556 23112311
chr6Chr6 2598250125982501 171051000171051000 0.9533120.953312 0.0007257610.000725761 280819280819
chr7Chr7 1000110001 56305005630500 0.9544140.954414 0.004618920.00461892 1097610976
chr7Chr7 56305015630501 3830650038306500 0.9557960.955796 0.001496650.00149665 6473264732
chr7Chr7 3830650138306501 3839400038394000 1.242151.24215 0.03689860.0368986 176176
chr7Chr7 3839400138394001 5508750055087500 0.9505290.950529 0.002122480.00212248 3313033130
chr7Chr7 5508750155087501 7328350073283500 0.9475450.947545 0.002722140.00272214 2715627156
chr7Chr7 7328350173283501 142340000142340000 0.9556590.955659 0.001074450.00107445 134626134626
chr7Chr7 142340001142340001 142491000142491000 1.143911.14391 0.02598090.0259809 303303
chr7Chr7 142491001142491001 159128500159128500 0.9658120.965812 0.002518710.00251871 3191431914
chr8Chr8 1150111501 4685750046857500 1.326351.32635 0.001757070.00175707 8437384373
chr8Chr8 4685750146857501 4774450047744500 1.013831.01383 0.0156170.015617 17351735
chr8Chr8 4774450147744501 146304000146304000 1.330531.33053 0.001120190.00112019 194805194805
chr9Chr9 1050110501 8962400089624000 0.9569590.956959 0.001208250.00120825 119638119638
chr9Chr9 8962400189624001 9004350090043500 1.259751.25975 0.01673810.0167381 836836
chr9Chr9 9004350190043501 9234300092343000 1.919591.91959 0.0104420.010442 45444544
chr9Chr9 9234300192343001 9335200093352000 3.248033.24803 0.02578860.0257886 14981498
chr9Chr9 9335200193352001 9369600093696000 3.542673.54267 0.03992310.0399231 683683
chr9Chr9 9369600193696001 9423200094232000 2.922732.92273 0.02837350.0283735 10681068
chr9Chr9 9423200194232001 9507650095076500 3.235263.23526 0.02493890.0249389 16681668
chr9Chr9 9507650195076501 9508000095080000 1.473831.47383 0.23430.2343 88
chr9Chr9 9508000195080001 9509900095099000 0.5805550.580555 0.04611020.0461102 3939
chr9Chr9 9509900195099001 124413000124413000 0.948530.94853 0.001617150.00161715 5816258162
chr9Chr9 124413001124413001 124419500124419500 1.574381.57438 0.2011910.201191 1414
chr9Chr9 124419501124419501 141128000141128000 0.9441090.944109 0.002480410.00248041 3261532615
chr10Chr10 6600166001 135525000135525000 0.9462170.946217 0.0007872180.000787218 255445255445
chr11Chr11 113001113001 134946500134946500 0.9509480.950948 0.0007772760.000777276 259377259377
chr12Chr12 6050160501 9350093500 1.884811.88481 0.2067040.206704 5252
chr12Chr12 9350193501 100378000100378000 0.9495640.949564 0.0008744330.000874433 193005193005
chr12Chr12 100378001100378001 133841500133841500 0.9439860.943986 0.001556710.00155671 6585265852
chr13Chr13 1902050119020501 115110000115110000 0.9841160.984116 0.0008933380.000893338 189923189923
chr14Chr14 1900000119000001 2253350022533500 0.9813440.981344 0.006612040.00661204 56385638
chr14Chr14 2253350122533501 2303800023038000 1.121461.12146 0.01514370.0151437 10091009
chr14Chr14 2303800123038001 7425350074253500 0.9487940.948794 0.001197840.00119784 101769101769
chr14Chr14 7425350174253501 7425450074254500 3.639013.63901 1.127181.12718 33
chr14Chr14 7425450174254501 107289500107289500 0.9432910.943291 0.001523860.00152386 6575065750
chr15Chr15 2000000120000001 2934600029346000 0.9401280.940128 0.003974120.00397412 1393313933
chr15Chr15 2934600129346001 102521500102521500 0.9461930.946193 0.001047060.00104706 141616141616
chr16Chr16 6000160001 3264650032646500 0.9502610.950261 0.001844980.00184498 6029660296
chr16Chr16 3264650132646501 3379650033796500 0.8893390.889339 0.01782430.0178243 11721172
chr16Chr16 3379650133796501 9028200090282000 0.9450440.945044 0.001359850.00135985 8930589305
chr17Chr17 11 61090006109000 0.919870.91987 0.003963130.00396313 1183611836
chr17Chr17 61090016109001 61255006125500 1.979351.97935 0.1638980.163898 3434
chr17Chr17 61255016125501 1665350016653500 0.9229420.922942 0.002867450.00286745 2094520945
chr17Chr17 1665350116653501 1673850016738500 0.7399320.739932 0.05005930.0500593 142142
chr17Chr17 1673850116738501 2226250022262500 0.9593840.959384 0.005575730.00557573 99589958
chr17Chr17 2226250122262501 2709750027097500 1.710121.71012 0.01065130.0106513 36473647
chr17Chr17 2709750127097501 3443250034432500 0.932980.93298 0.003395140.00339514 1457014570
chr17Chr17 3443250134432501 3450850034508500 1.282161.28216 0.0506950.050695 121121
chr17Chr17 3450900134509001 3621550036215500 0.914580.91458 0.007293530.00729353 28902890
chr17Chr17 3621550136215501 3641500036415000 1.08671.0867 0.03809840.0380984 263263
chr17Chr17 3641500136415001 3942150039421500 0.9498910.949891 0.005828020.00582802 59845984
chr17Chr17 3942150139421501 3943200039432000 0.2491880.249188 0.03862970.0386297 2020
chr17Chr17 3943200139432001 5199600051996000 0.9373820.937382 0.002644610.00264461 2421024210
chr17Chr17 5199600151996001 5203700052037000 1.37251.3725 0.05432020.0543202 8383
chr17Chr17 5203700152037001 5532200055322000 0.9382070.938207 0.004659890.00465989 65396539
chr17Chr17 5532200155322001 5563250055632500 0.6079870.607987 0.01020.0102 622622
chr17Chr17 5563250155632501 6860950068609500 0.9428950.942895 0.002468780.00246878 2557525575
chr17Chr17 6860950168609501 6874500068745000 1.263741.26374 0.02355190.0235519 272272
chr17Chr17 6874500168745001 8119500081195000 0.9435620.943562 0.002872590.00287259 2428524285
chr18Chr18 1000110001 99885009988500 1.327621.32762 0.003514050.00351405 1981719817
chr18Chr18 99885019988501 7801750078017500 0.9448280.944828 0.001065340.00106534 128567128567
chr19Chr19 8900189001 2396450023964500 0.9401280.940128 0.002117550.00211755 4661846618
chr19Chr19 2396450123964501 2402800024028000 0.5478030.547803 0.02436620.0243662 128128
chr19Chr19 2402800124028001 2462100024621000 0.9153630.915363 0.01189830.0118983 11551155
chr19Chr19 2462850124628501 5911900059119000 1.270991.27099 0.002162960.00216296 6194061940
chr20Chr20 6000160001 2942300029423000 0.9446340.944634 0.001665960.00166596 5192951929
chr20Chr20 2942300129423001 6296550062965500 1.64131.6413 0.00241250.0024125 6617066170
chr21Chr21 94110019411001 4006700040067000 0.9384750.938475 0.001646720.00164672 5252552525
chr21Chr21 4006700140067001 4044200040442000 0.6056130.605613 0.008974340.00897434 747747
chr21Chr21 4044200140442001 4812000048120000 0.9585540.958554 0.003691370.00369137 1499614996
chr22Chr22 1605000116050001 5123500051235000 0.9447870.944787 0.001744550.00174455 6659166591
七、通过步骤六得到了每个片段的TRE的均值和方差,以及该片段内包含的窗口数量。使用正态分布的方法,以每个片段的TRE均值和方差作为正态分布的均值和方差,将片段中的窗口按照正态分布进行平滑化。汇总所有片段smooth后的TRE以及对应窗口数的信息。7. Through step 6, the mean and variance of the TRE of each segment are obtained, and the number of windows included in the segment. Using the normal distribution method, the TRE mean and variance of each segment are used as the mean and variance of the normal distribution, and the windows in the segment are smoothed according to the normal distribution. Summarize the TRE of all clips and the corresponding number of windows.
八、对smooth后的TRE的窗口数进行自回归分析,得到P的取值为0.386。8. Autoregressive analysis of the number of TRE windows after smooth, and the value of P is 0.386.
九、P等于0.386时,第一个实际观测peak的TRE均值为0.562,第一个实际观测peak前最多可以存在1个理论peak,即N=1。可能的Q为:Q 0=1.334,Q 1=0.948,这两种Q的混合高斯模型的似然函数值分别为1.77E+07,1.78E+07。 9. When P is equal to 0.386, the TRE average of the first actual observed peak is 0.562. There can be at most one theoretical peak before the first actual observation peak, that is, N=1. The possible Q is: Q 0 = 1.334, Q 1 = 0.948. The likelihood function values of the mixed Gaussian models of the two Qs are 1.77E+07 and 1.78E+07, respectively.
十、计算P在取值范围[P-0.02,P+0.02]内,BIC校正后的混合高斯模型的极大似然值,计算结果如表2所示。X. Calculate the maximum likelihood of the mixed Gaussian model after BIC correction in the value range [P-0.02, P+0.02]. The calculation results are shown in Table 2.
表2在P的取值范围内混合高斯模型的结果Table 2 results of mixing Gaussian models in the range of values of P
P值P value Q值Q value 似然函数值Likelihood function value P值P value Q值Q value 似然函数值Likelihood function value
0.3660.366 0.9320.932 1.12E+061.12E+06 0.3860.386 1.3341.334 1.77E+071.77E+07
0.3660.366 1.2981.298 1.12E+061.12E+06 0.3870.387 0.9480.948 1.78E+071.78E+07
0.3670.367 0.9330.933 849906849906 0.3870.387 1.3351.335 1.77E+071.77E+07
0.3670.367 1.31.3 847125847125 0.3880.388 0.9480.948 1.84E+071.84E+07
0.3680.368 0.9340.934 795735795735 0.3880.388 1.3361.336 1.84E+071.84E+07
0.3680.368 1.3021.302 792858792858 0.3890.389 0.9480.948 1.88E+071.88E+07
0.3690.369 0.9350.935 832922832922 0.3890.389 1.3371.337 1.87E+071.87E+07
0.3690.369 1.3041.304 830037830037 0.390.39 0.9480.948 1.87E+071.87E+07
0.370.37 0.9360.936 1.05E+061.05E+06 0.390.39 1.3381.338 1.87E+071.87E+07
0.370.37 1.3061.306 1.04E+061.04E+06 0.3910.391 0.9480.948 1.88E+071.88E+07
0.3710.371 0.9370.937 1.18E+061.18E+06 0.3910.391 1.3391.339 1.87E+071.87E+07
0.3710.371 1.3081.308 1.17E+061.17E+06 0.3920.392 0.9490.949 1.84E+071.84E+07
0.3720.372 0.9380.938 1.72E+061.72E+06 0.3920.392 1.3411.341 1.83E+071.83E+07
0.3720.372 1.311.31 1.71E+061.71E+06 0.3930.393 0.950.95 1.84E+071.84E+07
0.3730.373 0.9390.939 3.74E+063.74E+06 0.3930.393 1.3431.343 1.83E+071.83E+07
0.3730.373 1.3121.312 3.73E+063.73E+06 0.3940.394 0.9510.951 1.80E+071.80E+07
0.3740.374 0.940.94 5.34E+065.34E+06 0.3940.394 1.3451.345 1.79E+071.79E+07
0.3740.374 1.3141.314 5.31E+065.31E+06 0.3950.395 0.9520.952 1.75E+071.75E+07
0.3750.375 0.9410.941 7.56E+067.56E+06 0.3950.395 1.3471.347 1.74E+071.74E+07
0.3750.375 1.3161.316 7.52E+067.52E+06 0.3960.396 0.9530.953 1.63E+071.63E+07
0.3760.376 0.9420.942 8.71E+068.71E+06 0.3960.396 1.3491.349 1.62E+071.62E+07
0.3760.376 1.3181.318 8.67E+068.67E+06 0.3970.397 0.9540.954 1.53E+071.53E+07
0.3770.377 0.9430.943 1.08E+071.08E+07 0.3970.397 1.3511.351 1.52E+071.52E+07
0.3770.377 1.321.32 1.07E+071.07E+07 0.3980.398 0.9550.955 1.34E+071.34E+07
0.3780.378 0.9440.944 1.58E+071.58E+07 0.3980.398 1.3531.353 1.33E+071.33E+07
0.3780.378 1.3221.322 1.57E+071.57E+07 0.3990.399 0.9560.956 1.17E+071.17E+07
0.3790.379 0.9450.945 1.69E+071.69E+07 0.3990.399 1.3551.355 1.17E+071.17E+07
0.3790.379 1.3241.324 1.68E+071.68E+07 0.40.4 0.9570.957 8.63E+068.63E+06
0.380.38 0.9460.946 1.78E+071.78E+07 0.40.4 1.3571.357 8.57E+068.57E+06
0.380.38 1.3261.326 1.77E+071.77E+07 0.4010.401 0.9580.958 6.79E+066.79E+06
0.3810.381 0.9470.947 1.88E+071.88E+07 0.4010.401 1.3591.359 6.74E+066.74E+06
0.3810.381 1.3281.328 1.87E+071.87E+07 0.4020.402 0.9590.959 1.77E+061.77E+06
0.3820.382 0.9480.948 1.93E+071.93E+07 0.4020.402 1.3611.361 1.76E+061.76E+06
0.3820.382 1.331.33 1.92E+071.92E+07 0.4030.403 0.960.96 831069831069
0.3830.383 0.9480.948 1.92E+071.92E+07 0.4030.403 1.3631.363 826993826993
0.3830.383 1.3311.331 1.91E+071.91E+07 0.4040.404 0.9610.961 412408412408
0.3840.384 0.9480.948 1.92E+071.92E+07 0.4040.404 1.3651.365 411139411139
0.3840.384 1.3321.332 1.91E+071.91E+07 0.4050.405 0.9620.962 351299351299
0.3850.385 0.9480.948 1.88E+071.88E+07 0.4050.405 1.3671.367 350311350311
0.3850.385 1.3331.333 1.87E+071.87E+07 0.4060.406 0.9630.963 352988352988
0.3860.386 0.9480.948 1.78E+071.78E+07 0.4060.406 1.3691.369 352002352002
十一、步骤十中的结果显示,P为0.382时,混合模型取极大值,此时的Q为0.948,据此可计算获得癌症样本纯度为0.80,癌症细胞染色体倍性为2.14。XI. The results in step 10 show that when P is 0.382, the mixed model takes the maximum value, and the Q at this time is 0.948. According to this, the purity of the cancer sample can be calculated to be 0.80, and the chromosome ploidy of the cancer cell is 2.14.

Claims (13)

  1. 一种用于计算癌症样本中癌症细胞纯度和染色体倍性的方法,所述方法包括以下步骤:A method for calculating cancer cell purity and ploidy ploidy in a cancer sample, the method comprising the steps of:
    步骤A:Step A:
    获取配对的癌症组织样本和正常组织样本的全基因组测序数据,并将测序数据比对到参考基因组;Obtaining whole genome sequencing data of paired cancer tissue samples and normal tissue samples, and comparing the sequencing data to a reference genome;
    步骤B:Step B:
    从步骤A得到的比对结果文件中,提取read位置和长度信息,HGSNV位点和覆盖该位点的read数量信息,计算所有HGSNV的MAF,其中,计算公式如(1.1)所示:From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):
    Figure PCTCN2018078908-appb-100001
    Figure PCTCN2018078908-appb-100001
    公式(1.1)中,n r为包含与参考基因组相同等位基因的read数量,n a为包含另一种等位基因的read的数量,n t表示覆盖该HGSNV位点的总read数量,C为该HGSNV的MAF值; In formula (1.1), n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;
    步骤C:Step C:
    根据步骤B得到的read位置和长度信息,以window为单位统计各window内包含的read数量,使用基因组GC含量校正所有window内read数量;According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;
    步骤D:Step D:
    使用步骤C校正后的read数量,使用公式(1)计算每一个window的TRE,然后运用TRE,通过BIC-seq软件对基因组进行片段化,获得以拷贝数划分的基因组片段:Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:
    Figure PCTCN2018078908-appb-100002
    Figure PCTCN2018078908-appb-100002
    公式(1)中,
    Figure PCTCN2018078908-appb-100003
    Figure PCTCN2018078908-appb-100004
    分别表示在癌症样本中覆盖片段s的read数量和在正常样本中覆盖片段s的read数量,N t表示癌症样本总read数量,N n表示相应正常样本总read数量,e s为TRE值;
    In formula (1),
    Figure PCTCN2018078908-appb-100003
    with
    Figure PCTCN2018078908-appb-100004
    Respectively indicates the number of reads covering the segment s in the cancer sample and the number of reads covering the segment s in the normal sample, N t represents the total number of reads of the cancer sample, N n represents the total number of reads of the corresponding normal sample, and e s is the TRE value;
    步骤E:Step E:
    以步骤D中BIC-seq处理后的基因组片段为单位,统计片段内所有window的TRE的均值、方差和该片段内window数量,根据均值和方差对基因组每个片段的window数量进行平滑化处理,使TRE的分布更均匀,然后将平滑化处理后所有片段的window分布汇总,得到基因组上window随TRE变化的分布结果;同时以片段为单位,计算片段中所有HGSNV的MAF的均值和方差;Taking the genomic fragments processed by BIC-seq in step D as a unit, the mean value, variance and the number of windows in the window of all the windows in the segment are counted, and the number of windows of each segment of the genome is smoothed according to the mean and variance. The distribution of TRE is more uniform, and then the window distribution of all fragments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean and variance of MAF of all HGSNVs in the fragment are calculated in units of fragments;
    步骤F:Step F:
    使用如公式(12)、(13)所示的类自回归模型,计算相邻拷贝数片段内TRE的差值即P,其中,遍历一定范围的P,计算Y(P),在Y(P)的分布中,选择第二高峰内Y(P)的最大值对应的P作为P的计算结果:Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, which traverses a certain range of P, and calculates Y(P) in Y(P). In the distribution of the ), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:
    Figure PCTCN2018078908-appb-100005
    Figure PCTCN2018078908-appb-100005
    Figure PCTCN2018078908-appb-100006
    Figure PCTCN2018078908-appb-100006
    公式(12)和(13)中,X t表示0到M t之间的TRE值;t表示扩大了1000倍的TRE值;M t表示TRE的最大值;变量P表示两个TRE位点的间隔;C(X t)表示在TRE为X t的位点,对应的window数量;C(X t+1000×P)表示在TRE为X t+1000×P的位点,对应的window数量;Y(P)表示在变量P下,类自回归模型的函数值; In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) represents the number of windows corresponding to the position where TRE is X t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;
    步骤G:Step G:
    根据步骤F得到的P,计算TRE分布中第一个实际观测peak的TRE均值,然后计算在第一个实际peak之前最多可能存在理论peak的数量N,最后当第一个实际peak之前存在n个理论peak时,计算Q的值,以Q n表示,其中步骤G包括: According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q n , where step G includes:
    G1:G1:
    根据步骤F计算的P,使用公式(13.1),选取使公式(13.1)取最大值的X f作为第一个实际观测peak的TRE均值: According to the P calculated in step F, using formula (13.1), select X f which takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:
    Figure PCTCN2018078908-appb-100007
    Figure PCTCN2018078908-appb-100007
    公式(13.1)中,i表示第i个peak,C(X f+P×i)表示在TRE为X f+P×i的位点,对应的window数量,n表示M t以内peak的最大数量,M t表示TRE的最大值; In formula (13.1), i represents the ith peak, C(X f + P × i) represents the position where TRE is X f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M t , M t represents the maximum value of TRE;
    G2:G2:
    使用公式(13.2),根据步骤F计算的P和步骤G1计算的X f,计算在X f之前最多可能存在的peak数量N: Using formula (13.2), calculate the maximum number of peaks N that may exist before X f based on P calculated in step F and X f calculated in step G1:
    Figure PCTCN2018078908-appb-100008
    Figure PCTCN2018078908-appb-100008
    公式(13.2)中,X f表示第一个peak的均值,P表示相邻拷贝数片段对应的peak之间的间距,floor表示向下取整数; In formula (13.2), X f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;
    G3:G3:
    利用步骤G2计算的N值,当n取0到N之间的整数时,使用公式(13.3)计算Q n的值: Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q n is calculated using equation (13.3):
    Q n=X f-n×P+2×P=X f+(2-n)×P,n∈[0,N]  (13.3) Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3)
    公式(13.3)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,Q n表示在X f之前理论上存在n个peak时的Q值; In formula (13.3), n represents the number of peaks before X f , the value ranges from 0 to N, P represents the spacing between peaks of adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, Q n represents the Q value when there are theoretically n peaks before X f ;
    步骤H:Step H:
    使用步骤F计算的P与步骤G计算的Q n,使用公式(10)、(11)计算癌症样本纯度γ和染色体倍性k: Using the P calculated in step F and the Q n calculated in step G, the cancer sample purity γ and chromosome ploidy k are calculated using equations (10), (11):
    Figure PCTCN2018078908-appb-100009
    Figure PCTCN2018078908-appb-100009
    Figure PCTCN2018078908-appb-100010
    Figure PCTCN2018078908-appb-100010
    公式(10)、(11)中,γ表示样本纯度,k表示染色体倍性,由此对(P,Q N)得到对应的(γ,k); In the formulas (10) and (11), γ represents the purity of the sample, and k represents the ploidy of the chromosome, whereby the corresponding (γ, k) is obtained for (P, Q N );
    步骤I:Step I:
    当n取[0,N]之间的某个整数值时,使用公式(13.4)计算第i个peak的TRE均值:When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:
    T i=X f-n×P+i×P=X f+(i-n)×P,n∈[0,N]  (13.4) T i =X f -n×P+i×P=X f +(in)×P,n∈[0,N] (13.4)
    公式(13.4)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,T i表示第i个peak的TRE均值, In formula (13.4), n represents the number of peaks before X f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, T i represents the TRE mean of the ith peak,
    对于落在T i附近的片段,认为该片段具有拷贝数i;对于没有落在T i附近的片段,将其归类为亚克隆片段,在后续分析中剔除所有亚克隆片段;然后根据步骤H计算的癌症样本纯度γ和peak对应的拷贝数,计算peak的MAF的期望f b,不同peak的MAF期望不同,对基因组上的所有peak,最终得到MAF期望的集合{f b};同时计算各个peak的TRE均值和方差或标准差; For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity γ and peak, calculate the expected f b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, the desired set of {F b } of the MAF is finally obtained; Peak TRE mean and variance or standard deviation;
    步骤J:Step J:
    根据步骤F计算的P和步骤I计算的{f b}构建如公式(19)所示的用“贝叶斯信息准则”校正后的混合高斯分布模型,然后对模型极大似然估计;其中,步骤J包括如下几步: According to P calculated in step F and {f b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J includes the following steps:
    J1:J1:
    以步骤F计算的P构建如公式(17)所示的高斯分布模型:Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:
    Figure PCTCN2018078908-appb-100011
    Figure PCTCN2018078908-appb-100011
    公式(17)中,L(e s;γ,k)表示基因组片段TRE的似然函数,N表示基因组上的所有window的数量,I表示基因组中所有片段的最大的拷贝数,σ i表示拷贝数为i的所有片段的TRE的标准差由步骤I得到,e s为第s个window的TRE观测值,S i表示第i个peak的TRE均值即步骤I中的T i,p i表示第s个window的拷贝数为i的权重,对所有的i,p i均取 值为1; In equation (17), L(e s ; γ, k) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e s is the TRE observation value of the sth window, and S i represents the TRE mean value of the i th peak, that is, T i in step I, p i represents the first The copy number of s windows is the weight of i, and the value of all i, p i is 1;
    J2:J2:
    以步骤I计算的f b构建如公式(18)所示的高斯分布模型: Construct a Gaussian distribution model as shown in equation (18) with f b calculated in step I:
    Figure PCTCN2018078908-appb-100012
    Figure PCTCN2018078908-appb-100012
    公式(18)中,L(f s;γ,k)表示HGSNV的似然函数,M表示基因组中所有HGSNV数量,S表示第S个HGSNV,I表示基因组中所有片段的最大的拷贝数;F i,j表示拷贝数为i,主要等位基因的拷贝数为j的片段内HGSNV的MAF期望值,由步骤I得到;f s表示该片段内所有HGSNV的MAF的观测值均值,由步骤E得到;σ i,j表示该片段内所有HGSNV的MAF观测值的标准差,由步骤E得到;p i,j表示在主要等位基因的拷贝数为j时,高斯分布的权重,对所有的i和j,p i,j取值均为1,p i表示第S个HGSNV所在片段的拷贝数为i的权重,对所有的i,p i取值均为1; In equation (18), L(f s ; γ, k) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E ;σ i,j denotes the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p i,j denotes the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1, p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p i is 1;
    J3:J3:
    将(17)与(18)相加得到混合高斯模型,然后对混合模型进行BIC(Bayesian Information Criterion)校正得到最终混合模型如公式(19):Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC (Bayesian Information Criterion) correction on the mixed model to obtain the final mixed model as Equation (19):
    BIC(e s,f s;γ,k)=-2×log L(f s;γ,k)-2×log L(e s;γ,k)+I×log(N)+J×log(M) (19) BIC(e s ,f s ;γ,k)=-2×log L(f s ;γ,k)-2×log L(e s ;γ,k)+I×log(N)+J×log (M) (19)
    公式(19)中,BIC(e s,f s;γ,k)表示混合模型的似然函数,I表示基因组中所有片段的最大的拷贝数,J是公式(18)中j的取值个数,N是基因组中window的数量,M是基因组中HGSNV的个数, In the formula (19), BIC(e s , f s ; γ, k) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
    对[0,N]范围内的每一个整数值n,通过步骤G得到Q n,或者通过步骤I得到所有peak的MAF期望的集合{f b},由一对(P,{f b})构建一个公式(19)所示的模型; For each integer value n in the range [0, N], Q n is obtained by step G, or a set of MAF expected {f b } of all peaks is obtained by step I, by a pair (P, {f b }) Construct a model as shown in equation (19);
    步骤K:Step K:
    以0.001为分辨率,对[P-m,P+m]区间的所有P值,重复步骤G~J,得到一系列不同的(P,Q n)与对应的似然函数值,取最大的似然函数值对应的(P,Q n)作为最合适的P和Q值,m是0到0.5之间的一个值; Repeat steps G to J for all P values in the [Pm, P+m] interval with a resolution of 0.001, and obtain a series of different (P, Q n ) and corresponding likelihood function values, taking the maximum likelihood. The function value corresponds to (P, Q n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;
    步骤L:Step L:
    查询步骤H的结果,找到在步骤K得到的(P,Q)下,对应的癌症样本纯度和染色体倍性。The results of step H are queried, and the corresponding cancer sample purity and chromosome ploidy are found under (P, Q) obtained in step K.
  2. 一种用于计算癌症样本中癌症细胞纯度和染色体倍性的装置,其包括处理器,所述处理器用于运行程序,所述程序运行时执行以下步骤:An apparatus for calculating cancer cell purity and chromosome ploidy in a cancer sample, comprising a processor for running a program, the program running to perform the following steps:
    步骤A:Step A:
    获取配对的癌症组织样本和正常组织样本的全基因组测序数据,并将测序数据比对到参考基因组;Obtaining whole genome sequencing data of paired cancer tissue samples and normal tissue samples, and comparing the sequencing data to a reference genome;
    步骤B:Step B:
    从步骤A得到的比对结果文件中,提取read位置和长度信息,HGSNV位点和覆盖该位点的read数量信息,计算所有HGSNV的MAF,其中,计算公式如(1.1)所示:From the comparison result file obtained in step A, the read position and length information, the HGSNV site and the read quantity information covering the site are extracted, and the MAF of all HGSNVs is calculated, wherein the calculation formula is as shown in (1.1):
    Figure PCTCN2018078908-appb-100013
    Figure PCTCN2018078908-appb-100013
    公式(1.1)中,n r为包含与参考基因组相同等位基因的read数量,n a为包含另一种等位基因的read的数量,n t表示覆盖该HGSNV位点的总read数量,C为该HGSNV的MAF值; In formula (1.1), n r is the number of reads containing the same allele as the reference genome, n a is the number of reads containing another allele, and n t is the total number of reads covering the HGSNV site, C The MAF value of the HGSNV;
    步骤C:Step C:
    根据步骤B得到的read位置和长度信息,以window为单位统计各window内包含的read数量,使用基因组GC含量校正所有window内read数量;According to the read position and length information obtained in step B, the number of reads contained in each window is counted in units of window, and the number of reads in all windows is corrected by using the genomic GC content;
    步骤D:Step D:
    使用步骤C校正后的read数量,使用公式(1)计算每一个window的TRE,然后运用TRE,通过BIC-seq软件对基因组进行片段化,获得以拷贝数划分的基因组片段:Using the number of reads corrected in step C, the TRE of each window is calculated using equation (1), and then the genome is fragmented by BIC-seq software using TRE to obtain a genomic fragment divided by copy number:
    Figure PCTCN2018078908-appb-100014
    Figure PCTCN2018078908-appb-100014
    公式(1)中,
    Figure PCTCN2018078908-appb-100015
    Figure PCTCN2018078908-appb-100016
    分别表示在癌症样本中覆盖片段s的read数量和在正常样本中覆盖片段s的read数量,N t表示癌症样本总read数量,N n表示相应正常样本总read数量,e s为TRE值;
    In formula (1),
    Figure PCTCN2018078908-appb-100015
    with
    Figure PCTCN2018078908-appb-100016
    Respectively indicates the number of reads covering the segment s in the cancer sample and the number of reads covering the segment s in the normal sample, N t represents the total number of reads of the cancer sample, N n represents the total number of reads of the corresponding normal sample, and e s is the TRE value;
    步骤E:Step E:
    以步骤D中BIC-seq处理后的基因组片段为单位,统计片段内所有window的TRE的均值、方差和该片段内window数量,根据均值和方差对基因组每个片段的window数量进行平滑化处理,使TRE的分布更均匀,然后将平滑化处理后所有片段的window分布汇总,得到基因组上window随TRE变化的分布结果;同时以片段为单位,计算片段中所有HGSNV的MAF的均值和方差;Taking the genomic fragments processed by BIC-seq in step D as a unit, the mean value, variance and the number of windows in the window of all the windows in the segment are counted, and the number of windows of each segment of the genome is smoothed according to the mean and variance. The distribution of TRE is more uniform, and then the window distribution of all fragments after smoothing is summarized to obtain the distribution result of window change with TRE on the genome; and the mean and variance of MAF of all HGSNVs in the fragment are calculated in units of fragments;
    步骤F:Step F:
    使用如公式(12)、(13)所示的类自回归模型,计算相邻拷贝数片段内TRE的差值即P,其中,遍历一定范围的P,计算Y(P),在Y(P)的分布中,选择第二高峰内Y(P)的最大值对应的P作为P的计算结果:Using the autoregressive model as shown in equations (12) and (13), calculate the difference of TRE in the adjacent copy number segment, that is, P, which traverses a certain range of P, and calculates Y(P) in Y(P). In the distribution of the ), the P corresponding to the maximum value of Y(P) in the second peak is selected as the calculation result of P:
    Figure PCTCN2018078908-appb-100017
    Figure PCTCN2018078908-appb-100017
    Figure PCTCN2018078908-appb-100018
    Figure PCTCN2018078908-appb-100018
    公式(12)和(13)中,X t表示0到M t之间的TRE值;t表示扩大了1000倍的TRE值;M t表示TRE的最大值;变量P表示两个TRE位点的间隔;C(X t)表示在TRE为X t的位点,对应的window数量;C(X t+1000×P)表示在TRE为X t+1000×P的位点,对应的window数量;Y(P)表示在变量P下,类自回归模型的函数值; In equations (12) and (13), X t represents the TRE value between 0 and M t ; t represents a TRE value that is expanded by 1000 times; M t represents the maximum value of TRE; and variable P represents the two TRE sites. Interval; C(X t ) represents the number of windows corresponding to the position where TRE is X t ; C(X t+1000×P ) represents the number of windows corresponding to the position where TRE is X t+1000×P ; Y(P) represents the function value of the auto-regressive model under the variable P;
    步骤G:Step G:
    根据步骤F得到的P,计算TRE分布中第一个实际观测peak的TRE均值,然后计算在第一个实际peak之前最多可能存在理论peak的数量N,最后当第一个实际peak之前存在n个理论peak时,计算Q的值,以Q n表示,其中步骤G包括: According to the P obtained in step F, calculate the TRE mean of the first actual observed peak in the TRE distribution, and then calculate the maximum number of theoretical peaks N before the first actual peak, and finally there are n before the first actual peak. In the theoretical peak, the value of Q is calculated, denoted by Q n , where step G includes:
    G1:G1:
    根据步骤F计算的P,使用公式(13.1),选取使公式(13.1)取最大值的X f作为第一个实际观测peak的TRE均值: According to the P calculated in step F, using formula (13.1), select X f which takes the maximum value of formula (13.1) as the TRE mean of the first actual observed peak:
    Figure PCTCN2018078908-appb-100019
    Figure PCTCN2018078908-appb-100019
    公式(13.1)中,i表示第i个peak,C(X f+P×i)表示在TRE为X f+P×i的位点,对应的window数量,n表示M t以内peak的最大数量,M t表示TRE的最大值; In formula (13.1), i represents the ith peak, C(X f + P × i) represents the position where TRE is X f + P × i, the corresponding number of windows, and n represents the maximum number of peaks within M t , M t represents the maximum value of TRE;
    G2:G2:
    使用公式(13.2),根据步骤F计算的P和步骤G1计算的X f,计算在X f之前最多可能存在的peak数量N: Using formula (13.2), calculate the maximum number of peaks N that may exist before X f based on P calculated in step F and X f calculated in step G1:
    Figure PCTCN2018078908-appb-100020
    Figure PCTCN2018078908-appb-100020
    公式(13.2)中,X f表示第一个peak的均值,P表示相邻拷贝数片段对应的peak之间的间距,floor表示向下取整数; In formula (13.2), X f represents the mean of the first peak, P represents the spacing between peaks corresponding to adjacent copy number segments, and floor represents an integer below;
    G3:G3:
    利用步骤G2计算的N值,当n取0到N之间的整数时,使用公式(13.3)计算Q n的值: Using the value of N calculated in step G2, when n takes an integer between 0 and N, the value of Q n is calculated using equation (13.3):
    Q n=X f-n×P+2×P=X f+(2-n)×P,n∈[0,N]  (13.3)公式(13.3)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,Q n表示在X f之前理论上存在n个peak时的Q值; Q n =X f -n×P+2×P=X f +(2-n)×P,n∈[0,N] (13.3) In the formula (13.3), n represents the number of peaks before X f , The value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, X f represents the TRE mean of the first actually observed peak, and Q n represents the theoretical existence before X f Q value at n peaks;
    步骤H:Step H:
    使用步骤F计算的P与步骤G计算的Q n,使用公式(10)、(11)计算癌症样本纯度γ和染色体倍性k: Using the P calculated in step F and the Q n calculated in step G, the cancer sample purity γ and chromosome ploidy k are calculated using equations (10), (11):
    Figure PCTCN2018078908-appb-100021
    Figure PCTCN2018078908-appb-100021
    Figure PCTCN2018078908-appb-100022
    Figure PCTCN2018078908-appb-100022
    公式(10)、(11)中,γ表示样本纯度,k表示染色体倍性,由此对(P,Q N)得到对应的(γ,k); In the formulas (10) and (11), γ represents the purity of the sample, and k represents the ploidy of the chromosome, whereby the corresponding (γ, k) is obtained for (P, Q N );
    步骤I:Step I:
    当n取[0,N]之间的某个整数值时,使用公式(13.4)计算第i个peak的TRE均值:When n takes an integer value between [0, N], the formula (13.4) is used to calculate the TRE mean of the i-th peak:
    T i=X f-n×P+i×P=X f+(i-n)×P,n∈[0,N]  (13.4) T i =X f -n×P+i×P=X f +(in)×P,n∈[0,N] (13.4)
    公式(13.4)中,n表示X f之前peak的数量,取值范围是0到N之间的整数,P表示相邻拷贝数片段对应的peak之间的间距,X f表示第一个实际观测peak的TRE均值,T i表示第i个peak的TRE均值, In formula (13.4), n represents the number of peaks before X f , the value range is an integer between 0 and N, P represents the spacing between peaks corresponding to adjacent copy number segments, and X f represents the first actual observation. The TRE mean of peak, T i represents the TRE mean of the ith peak,
    对于落在T i附近的片段,认为该片段具有拷贝数i;对于没有落在T i附近的片段,将其归类为亚克隆片段,在后续分析中剔除所有亚克隆片段;然后根据步骤H计算的癌症样本纯度γ和peak对应的拷贝数,计算peak的MAF的期望f b,不同peak的MAF期望不同,对基因组上的所有peak,最终得到MAF期望的集合{f b};同时计算各个peak的TRE均值和方差或标准差; For a fragment that falls near T i , the fragment is considered to have a copy number i; for a fragment that does not fall near T i , it is classified as a subcloned fragment, and all subcloned fragments are eliminated in subsequent analysis; Calculate the copy number of the cancer sample purity γ and peak, calculate the expected f b of the MAF of peak, and the MAF expectation of different peaks. For all peaks on the genome, the desired set of {F b } of the MAF is finally obtained; Peak TRE mean and variance or standard deviation;
    步骤J:Step J:
    根据步骤F计算的P和步骤I计算的{f b}构建如公式(19)所示的用“贝叶斯信息准则”校正后的混合高斯分布模型,然后对模型极大似然估计;其中,步骤J包括如下几步: According to P calculated in step F and {f b } calculated in step I, a mixed Gaussian distribution model corrected by "Bayesian information criterion" as shown in formula (19) is constructed, and then the maximum likelihood estimation of the model is performed; Step J includes the following steps:
    J1:J1:
    以步骤F计算的P构建如公式(17)所示的高斯分布模型:Construct a Gaussian distribution model as shown in equation (17) with P calculated in step F:
    Figure PCTCN2018078908-appb-100023
    Figure PCTCN2018078908-appb-100023
    公式(17)中,L(e s;γ,k)表示基因组片段TRE的似然函数,N表示基因组上的所有window的数量,I表示基因组中所有片段的最大的拷贝数,σ i表示拷贝数为i的所有片段的TRE的标准差由步骤I得到,e s为第s个window的TRE观测值,S i表示第i个peak的TRE均值即步骤I中的T i,p i表示第s个window的拷贝数为i的权重,对所有的i,p i均取值为1; In equation (17), L(e s ; γ, k) represents the likelihood function of the genomic fragment TRE, N represents the number of all windows on the genome, I represents the maximum copy number of all fragments in the genome, and σ i represents a copy The standard deviation of the TRE of all segments of number i is obtained by step I, e s is the TRE observation value of the sth window, and S i represents the TRE mean value of the i th peak, that is, T i in step I, p i represents the first The copy number of s windows is the weight of i, and the value of all i, p i is 1;
    J2:J2:
    以步骤I计算的f b构建如公式(18)所示的高斯分布模型: Construct a Gaussian distribution model as shown in equation (18) with f b calculated in step I:
    Figure PCTCN2018078908-appb-100024
    Figure PCTCN2018078908-appb-100024
    公式(18)中,L(f s;γ,k)表示HGSNV的似然函数,M表示基因组中所有HGSNV数量,S表示第S个HGSNV,I表示基因组中所有片段的最大的拷贝数;F i,j表示拷贝数 为i,主要等位基因的拷贝数为j的片段内HGSNV的MAF期望值,由步骤I得到;f s表示该片段内所有HGSNV的MAF的观测值均值,由步骤E得到,σ i,j表示该片段内所有HGSNV的MAF观测值的标准差,由步骤E得到;p i,j表示在主要等位基因的拷贝数为j时,高斯分布的权重,对所有的i和j,p i,j取值均为1,p i表示第S个HGSNV所在片段的拷贝数为i的权重,对所有的i,p i取值均为1; In equation (18), L(f s ; γ, k) represents the likelihood function of HGSNV, M represents the number of all HGSNVs in the genome, S represents the Sth HGSNV, and I represents the maximum copy number of all fragments in the genome; i, j represents the expected value of the MAF of the HGSNV in the fragment whose copy number is i, the copy number of the major allele is j, obtained from step I; f s represents the mean value of the observed value of the MAF of all HGSNVs in the fragment, obtained from step E , σ i,j represents the standard deviation of the MAF observations of all HGSNVs in the segment, obtained from step E; p i,j represents the weight of the Gaussian distribution when the copy number of the primary allele is j, for all i And j, p i, j have a value of 1, p i represents the weight of the copy of the segment where the S H HSNSNV is i, and the value of all i, p i is 1;
    J3:J3:
    将(17)与(18)相加得到混合高斯模型,然后对混合模型进行BIC校正得到最终混合模型如公式(19):Adding (17) and (18) to obtain a mixed Gaussian model, and then performing BIC correction on the mixed model to obtain the final mixed model as Equation (19):
    BIC(e s,f s;γ,k)=-2×log L(f s;γ,k)-2×log L(e s;γ,k)+I×log(N)+J×log(M) (19) BIC(e s ,f s ;γ,k)=-2×log L(f s ;γ,k)-2×log L(e s ;γ,k)+I×log(N)+J×log (M) (19)
    公式(19)中,BIC(e s,f s;γ,k)表示混合模型的似然函数,I表示基因组中所有片段的最大的拷贝数,J是公式(18)中j的取值个数,N是基因组中window的数量,M是基因组中HGSNV的个数, In the formula (19), BIC(e s , f s ; γ, k) represents the likelihood function of the mixed model, I represents the maximum copy number of all the fragments in the genome, and J is the value of j in the formula (18). Number, N is the number of windows in the genome, and M is the number of HGSNVs in the genome.
    对[0,N]范围内的每一个整数值n,通过步骤G得到Q n,或者通过步骤I得到所有peak的MAF期望的集合{f b},由一对(P,{f b})构建一个公式(19)所示的模型; For each integer value n in the range [0, N], Q n is obtained by step G, or a set of MAF expected {f b } of all peaks is obtained by step I, by a pair (P, {f b }) Construct a model as shown in equation (19);
    步骤K:Step K:
    以0.001为分辨率,对[P-m,P+m]区间的所有P值,重复步骤G~J,得到一系列不同的(P,Q n)与对应的似然函数值,取最大的似然函数值对应的(P,Q n)作为最合适的P和Q值,m是0到0.5之间的一个值; Repeat steps G to J for all P values in the [Pm, P+m] interval with a resolution of 0.001, and obtain a series of different (P, Q n ) and corresponding likelihood function values, taking the maximum likelihood. The function value corresponds to (P, Q n ) as the most suitable P and Q values, and m is a value between 0 and 0.5;
    步骤L:Step L:
    查询步骤H的结果,找到在步骤K得到的(P,Q)下,对应的癌症样本纯度和染色体倍性。The results of step H are queried, and the corresponding cancer sample purity and chromosome ploidy are found under (P, Q) obtained in step K.
  3. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤A中,采用1000基因组计划第三期(phase 3)项目使用的参考基因组hs37d5(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequ ence/hs37d5.fa.gz)作为所述参考基因组;和/或,比对软件使用Burrows-Wheeler Aligner(BWA),比对方法使用其中的bwa mem,最终获得癌症和正常样本的比对结果bam格式文件。The method according to claim 1 or the apparatus according to claim 2, wherein in the step A, the reference genome hs37d5 (ftp://ftp.1000genomes) used in the 1000th genome phase 3 (phase 3) project is used. .ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz) as the reference genome; and/or, the alignment software uses Burrows-Wheeler Aligner (BWA), the alignment method is used Among them, bwa mem, the final result of the comparison of cancer and normal samples is the bam format file.
  4. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤B中,采用samtools软件提取read的位置和长度信息,HGSNV的位点和覆盖该位点的read数量信息,其中,使用samtools view命令提取read信息时,使用参数-q 31过滤掉序列比对质量(MAPQ)低于31的序列,其中q表示过滤掉测序质量差的序列,同时使用参数-f 0x2-F 0x18过滤掉未能正确匹配的read,其中f表示提取符合一定要求的序列,F表示过滤符合一定要求的序列,使用samtools mpileup命令提取HGSNV信息时,使用参数-q  20过滤掉序列比对质量低于20的序列,并使用参数-Q 20过滤掉碱基质量小于20的序列,其中Q表示过滤掉碱基质量差的序列;选取等位基因频率时,使用samtools mpileup的-l参数;使用该参数需要提前准备一个包含SNP位点信息的bed格式文件。The method according to claim 1 or the device according to claim 2, wherein in the step B, the position and length information of the read, the location of the HGSNV and the information on the number of reads covering the location are extracted by using samtools software. When the read information is extracted by using the samtools view command, the sequence with the sequence alignment quality (MAPQ) lower than 31 is filtered out by using the parameter -q 31, where q indicates that the sequence with poor sequencing quality is filtered out, and the parameter -f 0x2-F is used. 0x18 filters out the read that fails to match correctly, where f indicates that the sequence meets the requirements, F indicates that the filtering meets the required sequence, and when the samtools mpileup command is used to extract the HGSNV information, the parameter -q 20 is used to filter out the sequence alignment. Sequence of 20, and using the parameter -Q 20 to filter out sequences with base quality less than 20, where Q indicates that the sequence with poor base quality is filtered out; when selecting the allele frequency, use the -l parameter of samtools mpileup; The parameter needs to prepare a bed format file containing SNP location information in advance.
  5. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,The method according to claim 1 or the device according to claim 2, wherein
    所述步骤C包括4步:The step C includes 4 steps:
    C1、将全基因组按照一定碱基长度的window为单位进行划分,对每个window统计覆盖该window的read数量,统计时以每条read的中点代表该read的位置;C1, the whole genome is divided according to a window of a certain base length, and the number of reads of the window is covered for each window statistics, and the position of the read is represented by the midpoint of each read in the statistics;
    C2、对参考基因组创建索引文件,提高GC含量的统计速度;C2. Create an index file for the reference genome to increase the statistical speed of the GC content;
    C3、以每个window的GC含量为自变量,以每个window的read数量为因变量,拟合read数量随GC含量变化的函数;C3, taking the GC content of each window as an independent variable, taking the number of reads of each window as the dependent variable, and fitting the function of the read quantity with the GC content;
    C4、使用拟合出的模型对全基因组read数量进行调整。C4. Adjust the number of whole genome reads using the fitted model.
  6. 根据权利要求5所述的方法或者装置,其中,所述步骤C2中,为参考基因组创建GC含量索引文件,对每一条染色体分别统计1、5、25、125个碱基间隔的区域内,鸟嘌呤(G)和胞嘧啶(C)的累积数量,其中,在统计某一个window中的GC含量时,用a*125+b*25+c*5+d*1的快速算法提取,其中a,b,c,d表示系数变量。The method or apparatus according to claim 5, wherein in the step C2, a GC content index file is created for the reference genome, and each of the chromosomes is counted in an area of 1, 5, 25, 125 base intervals, the bird The cumulative number of 嘌呤(G) and cytosine (C), which is extracted by a fast algorithm of a*125+b*25+c*5+d*1 when counting the GC content in a window, where a , b, c, d represent coefficient variables.
  7. 根据权利要求5所述的方法或者装置,其中,所述步骤C3中,使用步骤C1和步骤C2提取的各window的GC含量,通过如下弹性网络模型拟合read数量随GC含量变化,其中,使用window的GC含量为变量x,使用x,x 2,x 3,x 4,x 5,x 6作为弹性网络模型的输入变量,以read数量为输出变量,构建弹性网络模型如公式(20)所示: The method or apparatus according to claim 5, wherein in the step C3, using the GC content of each window extracted by the step C1 and the step C2, the amount of read is fitted with the GC content by the following elastic network model, wherein, The GC content of window is variable x, using x, x 2 , x 3 , x 4 , x 5 , x 6 as the input variables of the elastic network model, and the number of reads as the output variable to construct the elastic network model as shown in equation (20). Show:
    Figure PCTCN2018078908-appb-100025
    Figure PCTCN2018078908-appb-100025
    公式(20)中,y表示window内观测到的read数量,X表示输入变量矩阵,β表示变量系数矩阵,j表示变量系数下标,P表示系数总数,λ 1和λ 2表示罚分系数。 In equation (20), y represents the number of reads observed in the window, X represents the input variable matrix, β represents the variable coefficient matrix, j represents the variable coefficient subscript, P represents the total number of coefficients, and λ 1 and λ 2 represent the penalty coefficients.
  8. 根据权利要求5所述的方法或者装置,其中,所述步骤C4中,使用步骤C3中的模型预测每一个window理论上的read数量μ gc,基因组的平均GC含量定义为μ,window内观测到的read数量定义为y,window内校正后的read数量为Y,那么校正公式如下(21)所示: The method or apparatus according to claim 5, wherein in the step C4, the theoretical number of read μ gc of each window is predicted using the model in step C3, and the average GC content of the genome is defined as μ, which is observed in the window. The number of reads is defined as y, and the number of corrected reads in window is Y, then the correction formula is as follows (21):
    Figure PCTCN2018078908-appb-100026
    Figure PCTCN2018078908-appb-100026
  9. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤E中,使用步骤D中BIC-seq处理后的基因组片段为单位,计算片段所包含的window数量,TRE的平均值以及方差,然后对片段的TRE进行平滑化处理,处理方式如公式(22)所示,针对每一个基因组片段,以TRE的均值作为正态分布的均值μ,以TRE的方差作为正态分布的方差σ,计算出TRE在[μ-2σ,μ+2σ]范围内window数量的分布,定义v为TRE 坐标,取值范围为[μ-2σ,μ+2σ],分辨率为0.001,C win为该片段分配到v位点的window数量,C T表示该片段内window的总数,将所有片段的window根据TRE值平滑化后,可使片段内的window数量呈现正态分布,对所有片段各TRE位点对应的window数求和汇总,得到基因组范围的window随TRE变化的分布: The method according to claim 1 or the device according to claim 2, wherein in the step E, the number of windows included in the segment is calculated using the genomic segment processed by BIC-seq in step D, and the TRE is The mean value and the variance are then smoothed by the TRE of the segment. The processing is as shown in equation (22). For each genomic segment, the mean value of TRE is taken as the mean μ of the normal distribution, and the variance of TRE is taken as the normal state. The variance σ of the distribution is calculated as the distribution of the window number of the TRE in the range of [μ-2σ, μ+2σ], and the definition v is the TRE coordinate, the value range is [μ-2σ, μ+2σ], and the resolution is 0.001. C win is the number of windows allocated to the v-site, and C T is the total number of windows in the segment. After all the windows of the segment are smoothed according to the TRE value, the number of windows in the segment is normally distributed, for all The sum of the window numbers corresponding to each TRE site of the fragment is summed to obtain the distribution of the genome-wide window with the TRE change:
    Figure PCTCN2018078908-appb-100027
    Figure PCTCN2018078908-appb-100027
  10. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤F中,以0.001为分辨率,遍历[0,1]范围内的所有P,使用类自回归模型,计算Y(P)的值,Y(P)表现为多峰分布,使用第二高峰内Y(P)的最大值对应的P作为P的计算结果,M t是TRE的最大取值,这里将M t设置为3。 The method according to claim 1 or the apparatus according to claim 2, wherein in step F, all Ps in the range [0, 1] are traversed at a resolution of 0.001, and an autoregressive model is used to calculate The value of Y(P), Y(P) is expressed as a multimodal distribution, and P corresponding to the maximum value of Y(P) in the second peak is used as the calculation result of P, and M t is the maximum value of TRE, where M is t is set to 3.
  11. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤G中,步骤G包括3个步骤,步骤G1中,遍历[0,1]的TRE区间作为X f,过滤掉C(X f)小于1000的TRE位点,计算使公式(13.1)取最大值时的X f作为第一个实际观测peak的均值。 The method according to claim 1 or the device according to claim 2, wherein in the step G, the step G comprises three steps, in the step G1, traversing the TRE interval of [0, 1] as X f , filtering Drop the TRE site where C(X f ) is less than 1000, and calculate the X f when the maximum value of the formula (13.1) is taken as the mean value of the first actual observed peak.
  12. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤I中,然后根据步骤H计算的癌症样本纯度γ和peak对应的拷贝数,计算peak的MAF的期望f b,其中步骤I包括: The method according to claim 1 or the apparatus according to claim 2, wherein in the step I, the expected f b of the MAF of the peak is calculated according to the copy number corresponding to the cancer sample purity γ and peak calculated in the step H. Where step I includes:
    I1,使用公式(14)计算peak内HGSNV的MAF理论值:I1, using formula (14) to calculate the theoretical MAF value of HGSNV in peak:
    Figure PCTCN2018078908-appb-100028
    Figure PCTCN2018078908-appb-100028
    公式(14)中,C mcp表示主要等位基因的拷贝数,C cp表示peak的整体拷贝数,由步骤I得到,f表示该peak内MAF的理论值,可见当C cp较大时,f有多种不同的可能值; In formula (14), C mcp represents the copy number of the major allele, C cp represents the overall copy number of peak, which is obtained from step I, and f represents the theoretical value of MAF in the peak. It can be seen that when C cp is large, f There are many different possible values;
    I2,利用负二项分布估计覆盖每个HGSNV位点的read总数的概率,使用公式(15)计算负二项分布的概率p和失败次数r:I2, using the negative binomial distribution to estimate the probability of covering the total number of reads per HGSNV site, and using equation (15) to calculate the probability p and the number of failures r of the negative binomial distribution:
    Figure PCTCN2018078908-appb-100029
    Figure PCTCN2018078908-appb-100029
    公式(15)中,m是peak内所有window中read数量的均值,v是peak内所有window中read数量的方差,所求得的p是用于负二项分布的随机变量成功的概率,r为随机变量失败的次数,随机变量为覆盖某个HGSNV中的read数量;In equation (15), m is the mean of the number of reads in all windows in the peak, and v is the variance of the number of reads in all windows in the peak. The obtained p is the probability of success of the random variable used for the negative binomial distribution, r For the number of times a random variable fails, the random variable is the number of reads in a certain HGSNV;
    I3,利用二项分布求得的覆盖某个HGSNV的read数的概率,结合在一定read数量下,HGSNV只有两种基因型,服从二项分布规律,利用公式(16)计算f的校正值f b,同一个peak中,不同的C mcp计算得到不同的f b,选择与该peak的MAF观测均值最接近的f b作为该peak的f bI3, using the binomial distribution to obtain the probability of covering a certain HGSNV read number, combined with a certain number of read, HGSNV has only two genotypes, obey the binomial distribution law, and use formula (16) to calculate the correction value f of f B, with a peak in a different calculated C mcp different f b, and selecting the peak observed MAF closest to the mean of the peak as a f b f b:
    Figure PCTCN2018078908-appb-100030
    Figure PCTCN2018078908-appb-100030
    公式(16)中,k表示在某个HGSNV位点,某一种等位基因A或B的数量,d为覆盖该HGSNV的read数量,r为随机变量失败的次数,p是用于负二项分布的随机变量成功的概率;In formula (16), k represents the number of alleles A or B at a certain HGSNV site, d is the number of reads covering the HGSNV, r is the number of failed random variables, and p is used for negative two. The probability of success of the random variable of the item distribution;
    对每一个Q n,可推断获得基因组所有peak对应的拷贝数和癌症样本纯度,从而对每一个peak求f b,进而得到所有peak的MAF的期望值得集合{f b}。 For each Q n , it can be inferred that the copy number and cancer sample purity corresponding to all peaks of the genome are obtained, so that f b is obtained for each peak, and then the expected value set {f b } of the MAF of all peaks is obtained.
  13. 根据权利要求1所述的方法或者权利要求2所述的装置,其中,所述步骤K中,m取0.02作为P值得遍历区间为[P-0.02,P+0.02]。The method according to claim 1 or the apparatus according to claim 2, wherein, in the step K, m is taken as 0.02 as the P-worth traversal interval is [P-0.02, P+0.02].
PCT/CN2018/078908 2017-05-05 2018-03-14 Method and device for use in calculating cancer sample purity and chromosome ploidy WO2018201805A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710312237.7A CN108804876B (en) 2017-05-05 2017-05-05 Method and apparatus for calculating purity and chromosome ploidy of cancer sample
CN201710312237.7 2017-05-05

Publications (1)

Publication Number Publication Date
WO2018201805A1 true WO2018201805A1 (en) 2018-11-08

Family

ID=64016930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/078908 WO2018201805A1 (en) 2017-05-05 2018-03-14 Method and device for use in calculating cancer sample purity and chromosome ploidy

Country Status (2)

Country Link
CN (1) CN108804876B (en)
WO (1) WO2018201805A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110808084A (en) * 2019-09-19 2020-02-18 西安电子科技大学 Copy number variation detection method based on single-sample second-generation sequencing data
CN112767999A (en) * 2021-01-05 2021-05-07 中国科学院上海药物研究所 Analysis method and device for whole genome sequencing data

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627498B (en) * 2020-05-21 2022-10-04 北京吉因加医学检验实验室有限公司 Method and device for correcting GC bias of sequencing data
CN112216344A (en) * 2020-09-05 2021-01-12 西安翻译学院 Prediction method, system and storage medium of tumor purity and average ploidy information
CN113808009A (en) * 2021-09-24 2021-12-17 熵智科技(深圳)有限公司 Peak initial phase estimation method and device, computer equipment and storage medium
CN115948521B (en) * 2022-12-29 2024-06-25 东北林业大学 Method for detecting aneuploidy deletion chromosome information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104560697A (en) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 Detection device for instability of genome copy number
US20150337298A1 (en) * 2014-05-23 2015-11-26 Fluidigm Corporation Haploidome determination by digitized transposons
CN106460070A (en) * 2014-04-21 2017-02-22 纳特拉公司 Detecting mutations and ploidy in chromosomal segments

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106520940A (en) * 2016-11-04 2017-03-22 深圳华大基因研究院 Chromosomal aneuploid and copy number variation detecting method and application thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106460070A (en) * 2014-04-21 2017-02-22 纳特拉公司 Detecting mutations and ploidy in chromosomal segments
US20150337298A1 (en) * 2014-05-23 2015-11-26 Fluidigm Corporation Haploidome determination by digitized transposons
CN104560697A (en) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 Detection device for instability of genome copy number

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, YI ET AL.: "Deconvolving Tumor Purity and Ploidy by Integrating Copy Number Alterations and Loss of Heterozygosity", BIOINFORMATICS, vol. 30, no. 15, 1 August 2014 (2014-08-01), pages 2121 - 2129, XP055546291 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110808084A (en) * 2019-09-19 2020-02-18 西安电子科技大学 Copy number variation detection method based on single-sample second-generation sequencing data
CN110808084B (en) * 2019-09-19 2023-02-28 西安电子科技大学 Copy number variation detection method based on single-sample second-generation sequencing data
CN112767999A (en) * 2021-01-05 2021-05-07 中国科学院上海药物研究所 Analysis method and device for whole genome sequencing data

Also Published As

Publication number Publication date
CN108804876B (en) 2022-03-15
CN108804876A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
WO2018201805A1 (en) Method and device for use in calculating cancer sample purity and chromosome ploidy
Liu et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation
AU2017292854B2 (en) Methods for fragmentome profiling of cell-free nucleic acids
US20210246511A1 (en) Integrated machine-learning framework to estimate homologous recombination deficiency
Shafi et al. A survey of the approaches for identifying differential methylation using bisulfite sequencing data
Hasan et al. Performance evaluation of indel calling tools using real short-read data
US20210174901A1 (en) METHOD FOR SIMULTANEOUS DETECTION OF GENOME-WIDE COPY NUMBER CHANGES, cnLOH, INDELS, AND GENE MUTATIONS
Carmi et al. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins
JP6725481B2 (en) Non-invasive prenatal molecular karyotype analysis of maternal plasma
KR102447079B1 (en) Methods and processes for non-invasive assessment of genetic variations
JP7299169B2 (en) Methods and systems for determining clonality of somatic mutations
CN107423578B (en) Device for detecting somatic cell mutation
JP2022058904A (en) Methods for multi-resolution analysis of cell-free nucleic acids
CN110800063A (en) Detection of tumor-associated variants using cell-free DNA fragment size
US20190338349A1 (en) Methods and systems for high fidelity sequencing
JP2018500876A (en) Methods and processes for non-invasive assessment of genetic variation
CN108475300B (en) Custom-made drug selection method and system using genomic base sequence mutation information and survival information of cancer patient
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
CN112218957A (en) Systems and methods for determining tumor fraction in cell-free nucleic acids
Ritz et al. Detection of recurrent rearrangement breakpoints from copy number data
CN112823391A (en) Quality control metrics based on detection limits
US20160283654A1 (en) Computation pipeline of single-pass multiple variant calls
CN107563152A (en) The data analysis application system that methylates based on biological cloud platform
CN114974432A (en) Screening method of biomarker and related application thereof
CN114974415A (en) Method and device for detecting chromosome copy number abnormality

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18793979

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18793979

Country of ref document: EP

Kind code of ref document: A1