Disclosure of Invention
The application aims to provide a novel method, a device and a storage medium for evaluating single-sample homologous recombination defects.
In order to achieve the purpose, the following technical scheme is adopted in the application:
a first aspect of the present application discloses a method for evaluating a single-sample homologous recombination defect, comprising the steps of:
acquiring SNV mutation sites of a tumor sample to be detected, and mutation frequency and mutation site depth of each SNV mutation site, annotating the SNV mutation sites, and distinguishing embryonic SNV mutation and system SNV mutation from the tumor sample to be detected without a control sample to obtain a SNV mutation site set of the system;
a step of obtaining a system CNV mutation site set, which comprises analyzing segment sections of the CNV mutation of the tumor sample to be detected, the size of each segment section, the number of probes contained in each segment section and the BAF value of each segment section according to a comparison result file of the tumor sample to be detected, and annotating the CNV mutation sites to obtain the system CNV mutation site set;
obtaining a homologous repeated defect score (HRD) score value, wherein the obtained system SNV mutation site set and system CNV mutation site set are used for calculating an LOH score value, a TAI score value and an LST score value of a tumor sample to be detected; the HRD score for homologous recombination defects is the sum of the LST score, TAI score and LOH score values.
It should be noted that the key point of the present application is to directly detect the system SNV mutation site set and the system CNV mutation site set of a single sample, i.e., a tumor sample to be detected, so as to obtain the LOH score value, the TAI score value, and the LST score value, obtain the HRD score value of the homologous recombination defect score, and realize the evaluation of the homologous recombination defect of the single sample.
According to the method for evaluating the single-sample homologous recombination defect, the single-sample homologous recombination defect evaluation can be realized only by the tumor sample to be tested, the problem that the tumor sample cannot obtain the paired blood cell sample or the material is difficult to obtain is solved, the tumor sample is directly utilized for homologous recombination defect evaluation, the benefited population of targeted therapy can be enlarged, the opportunity of targeted therapy is strived for more patients, and the defect that the homologous recombination defect evaluation can be realized only by detecting gene mutation by the paired blood cell sample in the conventional method are overcome.
In one implementation manner of the application, the step of acquiring the SNV mutation site set of the system specifically comprises the steps of detecting the SNV mutation sites of the tumor sample to be detected and the mutation frequency and the mutation site depth of each SNV mutation site by using a Mutect2 module in GATK software; filtering and detecting the SNV mutation sites of the obtained tumor sample to be detected by adopting mutag software to obtain a system SNV mutation site set; the method specifically comprises the steps of taking a mutation detection result of a Mutect2 module as input, filtering and removing the embryonic system SNV mutation according to the following scheme, and keeping the system SNV mutation, wherein the mutag software is used for filtering to obtain the system SNV mutation point set: a) filtering out common germline SNV mutations by using a crowd database, wherein the crowd database comprises gnomaD and dbSNP; b) the "mutations" due to germline SNV mutations and sequencing system errors were removed by filtering using the normal human baseline database, which includes PoN.
Preferably, the filtering out of germline SNV mutations further comprises, c) distinguishing somatic mutations using the frequencies of germline mutations that differ from the frequencies of germline mutations inherited by mendelian by the frequency distribution of somatic mutations; d) filtering out germline mutations in combination with CNV/tumor purity information.
It should be noted that the SNV mutation detection scheme of the above system is only a detection method specifically adopted in one implementation manner of the present application, and does not exclude that other similar technical schemes may also be adopted.
In an implementation manner of the present application, the step of obtaining the system CNV mutation site set specifically includes using a comparison result file of a tumor sample to be detected as input of GATK software, and the GATK software sequentially uses seven modules, namely, calcuttargetcoverage, normalizesematoreactive, performsegration, CallSegments, getbyesinanhetcoverage, AllelicCNV, and convertatcnvresults, to detect a region of the tumor sample to be detected where the system CNV mutation occurs, so as to obtain the system CNV mutation site set.
The method comprises the steps that a GetBeyesianheHetCoverage module is used for obtaining heterozygous sites and site statistical information, and heterozygous site information is obtained without matching samples; through the combined use of the seven modules, the use of matched samples is avoided, and the CNV mutation area detection of the system can be carried out only by using the baseline sample. It should be understood that the GATK software and the seven specific modules thereof are also technical solutions specifically adopted in an implementation manner of the present application, and do not exclude that other similar software or functional modules may also be adopted, and are not specifically limited herein.
In an implementation manner of the application, the acquisition step of the HRD score value of the homologous repetitive defect score specifically comprises the steps of using the obtained SNV mutation site set and CNV mutation site set as the input of software scarHRD, and outputting the LOH score value, TAI score value and LST score value of a sample to be detected.
The second aspect of the application discloses a device for evaluating single-sample homologous recombination defects, which comprises a system SNV mutation site set acquisition module, a system CNV mutation site set acquisition module and a homologous repeated defect scoring HRD score value acquisition module; the system SNV mutation site set acquisition module comprises SNV mutation sites for acquiring a tumor sample to be detected, mutation frequency and mutation site depth of each SNV mutation site, the SNV mutation sites are annotated, and an embryonic system SNV mutation and a system SNV mutation are distinguished for the tumor sample to be detected without a control sample, so that a system SNV mutation site set is obtained; the system CNV mutation site set acquisition module comprises segment sections for analyzing CNV mutation of a tumor sample to be detected according to a comparison result of the tumor sample to be detected, the size of each segment section, the number of probes contained in each segment section and the BAF value of each segment section, and annotates CNV mutation sites to obtain a system CNV mutation site set; and the homologous recombination defect score HRD score value acquisition module is used for calculating the LOH score value, the TAI score value and the LST score value of the tumor sample to be detected according to the obtained system SNV mutation site set and the system CNV mutation site set, and the homologous recombination defect score HRD score value is the sum of the LST score value, the TAI score value and the LOH score value.
It should be noted that, the apparatus for evaluating a single-sample homologous recombination defect of the present application actually realizes each step in the method for evaluating a single-sample homologous recombination defect of the present application through each module; therefore, the specific definition of each module can refer to the method for evaluating the single-sample homologous recombination defect in the present application, which is not described herein again.
A third aspect of the present application discloses an apparatus for evaluating a single-sample homologous recombination defect, the apparatus comprising a memory and a processor; wherein, the memory comprises a memory for storing programs; and a processor including a program for implementing the method for evaluating single-sample homologous recombination defects of the present application by executing the program stored in the memory.
A fourth aspect of the present application discloses a computer-readable storage medium having a program stored therein, the program being executable by a processor to implement the method for evaluating a single-sample homologous recombination defect of the present application.
Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:
the method and the device for evaluating the single-sample homologous recombination defect can directly detect the system SNV mutation site and the system CNV mutation site of the tumor sample to be tested of the single sample, thereby obtaining the HRD score value of the homologous recombination defect score and realizing the evaluation of the single-sample homologous recombination defect. According to the method and the device, the single-sample homologous recombination defect evaluation can be realized only by the tumor sample to be detected, and the dependence of the conventional homologous recombination defect evaluation method on the blood cell sample is overcome.
Detailed Description
The present application will be described in further detail below with reference to the accompanying drawings by way of specific embodiments. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted or replaced with other devices, materials, or methods in various circumstances. In some instances, certain operations related to the present application have not been shown or described in detail in this specification in order to avoid obscuring the core of the present application from excessive description, and a detailed description of such related operations is not necessary for those skilled in the art, and the related operations will be fully understood from the description in the specification and the general knowledge of the art.
Studies have shown, on the one hand, that although the PARP inhibitor benefit factor is generally thought to be associated only with BRCA1/2 germline mutations, it may also be therapeutically effective in other tumors. For example, tumor cells, although lacking the BRCA1/2 germline mutation, may also be sensitive to PARPi inhibitors due to particular causes, such as HRR pathway gene mutations. The ASCO report indicates that the HRR pathway gene is more strongly associated with PARPi inhibitors, and that the evaluation of HRD by BRCA1/2 germline mutation alone is not sufficient. On the other hand, due to some reasons, it is difficult to obtain blood cell sample matched with tumor sample, so that it is impossible to accurately detect the gene germ line or system mutation site, thereby affecting the calculation of homologous recombination defect score, i.e. the calculation of HRD score value.
Based on the research and the recognition, the application develops a set of method for calculating the homologous recombination score based on a single sample by combining the defects of difficult material taking and single gene detection. The method for evaluating single-sample homologous recombination defects comprises a system SNV mutation site set obtaining step 11, a system CNV mutation site set obtaining step 12 and a homologous repeat defect scoring HRD score value obtaining step 13 as shown in figure 1.
The method comprises a step 11 of obtaining a system SNV mutation site set, wherein the step comprises the steps of obtaining SNV mutation sites of a tumor sample to be detected, mutation frequency and mutation site depth of each SNV mutation site, annotating the SNV mutation sites, and distinguishing embryonic system SNV mutation and system SNV mutation from the tumor sample to be detected without a control sample to obtain the system SNV mutation site set.
In an implementation manner of the application, specifically, information such as mutation frequency, mutation site depth and the like contained in a mutation site of a tumor sample is detected by using a Mutect2 module and mutag software in GATK software, a vcf file is output, and annotation of each database is performed on the mutation site of a system, so that a system mutation site set is obtained. Specifically, the mutation detection result of the Mutect2 module is used as input, the germline SNV mutation is removed by filtration according to the following scheme, and the system SNV mutation is retained: a) filtering out common germline SNV mutations by using a crowd database; b) filtering and removing embryonic SNV mutation and 'mutation' caused by sequencing system error by using a normal human baseline database; c) distinguishing somatic mutation by using the germ line mutation frequency of which the somatic mutation frequency distribution is different from that of Mendelian inheritance; d) filtering germline mutations in combination with CNV/tumor purity information. The scheme for filtering and removing the embryonic system SNV mutation is independently researched and developed, and the system SNV mutation site can be accurately, effectively and directly analyzed and obtained from the sequencing result of the tumor sample to be detected through the scheme.
And a step 12 of obtaining a system CNV mutation site set, which comprises analyzing segment sections of the CNV mutation of the tumor sample to be detected, the size of each segment section, the number of probes contained in each segment section and the BAF value of each segment section according to the comparison result of the tumor sample to be detected, and annotating the CNV mutation sites to obtain the system CNV mutation site set.
In an implementation manner of the application, specifically, a bam file of a tumor sample to be detected is used as an input file of the GATK software to detect system CNV mutation, and the software sequentially adopts seven modules, namely, calcutargetcoverage, normalizesematoreactive, PerformSegmentation, CallSegments, getbeyerianhetcoverage, AllelicCNV and convertatcnvresults, to detect a system CNV mutation region, so as to obtain a system CNV mutation region with high reliability.
It should be noted that the above seven modules are all existing modules; however, through the combined use of the seven modules, the use of matched samples is avoided, heterozygous site information is not required to be obtained by the matched samples, and the CNV mutant sites of the system can be effectively obtained only by using the baseline samples.
And a step 13 of obtaining an HRD score value of the homologous recombination defect score, which comprises the steps of calculating an LOH score value, a TAI score value and an LST score value of the tumor sample to be detected by using the obtained SNV mutation site set and CNV mutation site set, wherein the HRD score value of the homologous recombination defect score is the sum of the LST score value, the TAI score value and the LOH score value.
In an implementation manner of the present application, specifically, a filtered system CNV and SNV mutation site set is used as an input of an open source software scarHRD, and scores of three indexes of a tumor sample to be detected LOH score value, TAI score value and LST score value are output.
Those skilled in the art will appreciate that all or part of the functions of the above-described methods may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above method are implemented by means of a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated on a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above methods may be implemented.
Therefore, based on the method of the present application, the present application provides a device for evaluating a single-sample homologous recombination defect, as shown in fig. 2, which includes a system SNV mutation site set obtaining module 21, a system CNV mutation site set obtaining module 22, and a homologous repeat defect score HRD score value obtaining module 23.
The system SNV mutation site set obtaining module 21 is configured to obtain SNV mutation sites of the tumor sample to be detected, mutation frequencies and mutation site depths of the SNV mutation sites, annotate the SNV mutation sites, and distinguish embryonic SNV mutation from system SNV mutation in the tumor sample to be detected without a control sample, so as to obtain a system SNV mutation site set. For example, SNV mutation site detection and mutation frequency and mutation site depth analysis of each SNV mutation site are performed with reference to the Mutect2 module in GATK software.
The system CNV mutation site set obtaining module 22 is configured to analyze segment segments of the CNV mutation occurring in the tumor sample to be detected, the size of each segment, the number of probes included in each segment, and the BAF value of each segment according to the comparison result of the tumor sample to be detected, and annotate the CNV mutation sites to obtain a system CNV mutation site set. For example, referring to seven modules of CalculateTargetCoverage, NormalizeSomatoReadCounts, PerformRegulation, CallSegments, GetBeyesianenHetCoverage, AllelicCNV and ConvertACCNVResults of GATK software, the region of the tumor sample to be detected, in which the CNV mutation occurs in the system, is detected.
The acquisition module 23 of HRD score value of homologous repeat defect score includes a module for calculating LOH score value, TAI score value and LST score value of the tumor sample to be tested according to the obtained SNV mutation site set and CNV mutation site set. For example, LOH score, TAI score, LST score calculations are performed with reference to scarHRD software. The HRD score for homologous recombination defects is the sum of the LST score, TAI score and LOH score values.
HRD score value ═ LST score value + TAI score value + LOH score value.
The device can realize the method for evaluating the single-sample homologous recombination defect, particularly realize corresponding steps in the method through the modules of the device, thereby realizing automatic evaluation of the single-sample homologous recombination defect.
The application also provides a device for evaluating the single-sample homologous recombination defects, which comprises a memory and a processor; a memory including a memory for storing a program; a processor comprising instructions for implementing the following method by executing a program stored in a memory: acquiring SNV mutation sites of a tumor sample to be detected, and mutation frequency and mutation site depth of each SNV mutation site, annotating the SNV mutation sites, and distinguishing embryonic SNV mutation and system SNV mutation from the tumor sample to be detected without a control sample to obtain a SNV mutation site set of the system; a step of obtaining a system CNV mutation site set, which comprises analyzing segment sections of the CNV mutation of the tumor sample to be detected, the size of each segment section, the number of probes contained in each segment section and the BAF value of each segment section according to a comparison result file of the tumor sample to be detected, and annotating the CNV mutation sites to obtain the system CNV mutation site set; obtaining a HRD score value of the homologous repeated defect score, wherein the method comprises the steps of calculating a LOH score value, a TAI score value and an LST score value of a tumor sample to be detected by using an obtained SNV mutation site set and a CNV mutation site set of a system; the HRD score for homologous recombination defects is the sum of the LST score, TAI score and LOH score values.
There is also provided, in another implementation, a computer-readable storage medium including a program, the program being executable by a processor to perform a method comprising: acquiring SNV mutation sites of a tumor sample to be detected, and mutation frequency and mutation site depth of each SNV mutation site, annotating the SNV mutation sites, and distinguishing embryonic SNV mutation and system SNV mutation from the tumor sample to be detected without a control sample to obtain a SNV mutation site set of the system; a step of obtaining a system CNV mutation site set, which comprises analyzing segment sections of the CNV mutation of the tumor sample to be detected, the size of each segment section, the number of probes contained in each segment section and the BAF value of each segment section according to a comparison result file of the tumor sample to be detected, and annotating the CNV mutation sites to obtain the system CNV mutation site set; obtaining a homologous repeated defect score (HRD) score value, wherein the obtained system SNV mutation site set and system CNV mutation site set are used for calculating an LOH score value, a TAI score value and an LST score value of a tumor sample to be detected; the HRD score for homologous recombination defect score is the sum of the LST score, TAI score and LOH score values.
The method and the device for evaluating the homologous recombination defect of the single sample mainly comprise the following steps: 1. generating a sequencing data file, namely a bam format file, by using sequencing off-line data and performing the steps of comparison, sequencing, filtering, marking repetition and the like; 2. detecting a system SNV mutation system, namely detecting information such as mutation frequency, mutation site depth and the like contained in mutation sites of a tumor sample; 3. a system for filtering the SNV mutation of the system, wherein the SNV mutation of the system detected by a tumor sample is annotated at the gene level and some incredible mutation sites are filtered, and besides, the filtered mutation sites are annotated in a clinically relevant database; 4. taking a bam file compared with a sample to be detected as the input of GATK software, analyzing a segment section of the sample to be detected, which has CNV mutation, and outputting information such as the size of the segment section, the number of probes contained in the segment section, the BAF value of the segment section and the like; 5. and (3) taking the obtained CNV and SNV information of the sample to be tested as the input of scarHRD software, and predicting the HRD score value of the sample to be tested, namely obtaining the score values of the three indexes of LOH, LST and TAI respectively.
The method and the device for evaluating the homologous recombination defect of the single sample have the key technology that:
and (3) detecting SNV mutation of a single-sample realization system: the evaluation of the score of the homologous recombination defect mainly depends on the SNV mutation of a system, the credible SNV mutation of the system is favorable for the accuracy of HRD score value calculation, and when the SNV mutation of a single sample system is generally detected, the result contains the embryonic system mutation site. According to the method, a complete system is constructed mainly by means of a Mutect2 module, mutag software and a system mutation site annotation filtering module in GATK software, so that the SNV mutation site detection of a single-sample system is realized.
Detection of single sample implementation system CNV mutation: the homologous recombination deletion is scored by combining three indexes of LOH, LST and TAI to carry out genome instability, and the scoring of the three indexes is evaluated by combining SNV system mutation and CNV system mutation. The method utilizes seven modules such as GATK software CalculateTargetCoverage, GetBeyesianHetCoverage and the like to detect the CNV mutation of the system.
In order to eliminate the condition that the HRD score value is increased along with the increase of the WGD value, the sensitivity and the accuracy of the evaluation of the homologous recombination defect state are improved; further, the TAI score value and the LST score value are corrected by utilizing the WGD value, and the problem that the TAI score value and the LST score value of a whole genome multiplication sample to be detected are higher is solved. The specific correction method comprises the following steps:
acquiring a WGD value of the whole genome multiplication, wherein the WGD value of the sample to be detected under the optimal model is calculated by using the acquired SNV mutation site set and CNV mutation site set of the system;
and correcting, namely correcting the LST score value (marked as 'original LST score value') obtained in the acquisition step of the HRD score value of the corrected homologous duplication defect of the sample to be detected with WGD by using a first correction coefficient k1, correcting the TAI score value (marked as 'original TAI score value') obtained in the acquisition step of the HRD score value of the homologous duplication defect by using a second correction coefficient k2, and obtaining the LOH score value (marked as 'original LOH score value') obtained in the acquisition step of the HRD score value of the homologous duplication defect, namely the original LOH score value.
Corrected LST score value (1-k1 × WGD value) × original LST score value,
corrected TAI score value (1-k2 × WGD value) × original TAI score value,
corrected HRD score for homologous recombination defect score was the sum of corrected LST score, corrected TAI score and LOH score.
In the step of acquiring the whole genome multiplication WGD value, calculating the WGD value under the optimal model of the sample to be tested comprises the steps of outputting WGD values, purity values and ploidy values simulated under a plurality of models of the sample to be tested by using system CNV mutation and system SNV mutation as input of software ABSOLUTE, screening the predicted models to determine the optimal model, and obtaining the WGD value under the optimal model.
It should be noted that, the method for screening and determining the optimal model refers to patent 202010567812.X, which is as follows: (1) performing quality control on offline data of the tumor and normal samples, comparing the quality-controlled data to a reference genome, performing mutation site detection on paired comparison files of the tumor and normal samples, and performing crowd database annotation on mutation detection sites; (2) taking the data obtained in the step (1) as an input file of purity prediction software to obtain a purity and copy number information model; (3) whether the purity and copy number information model accords with the normal distribution is further judged through comparing the model doubling probe support number distribution with the whole genome doubling WGD, and the purity and copy number information model which does not accord with the normal distribution is deleted, specifically, if the WGD is 0, the peak value of the doubling probe support number distribution should be at ploidy 2, and if the WGD is 1, the peak value of the doubling probe support number distribution should be at ploidy 2 and ploidy 4; if WGD is 2, the peak of the probe support number distribution of the duplex should be at ploidy 4 and ploidy 8, and so on; if the information model does not accord with the rule, the purity and copy number information model is judged to be not in accordance with normal distribution, and the information model is deleted; (4) performing subclone region screening on the purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high-tumor cell fraction subclone region; (5) carrying out consistency statistics on copy numbers of BAF, allele1 and allele2 obtained by calculation of purity prediction software to obtain the proportion of consistent fragments, wherein the calculation formula is shown as formula I, and the formula is M ═ f ÷ (f + b); in the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers; the condition that BAF matches with the copy numbers of allel 1 and allel 2 is that BAF is 0.5, and the copy number of allel 1 is 2, and the BAF is judged to be matched; or BAF is not equal to 0.5, and the allele1 copy number is not equal to the allele2 copy number, and the judgment is that the two are matched; the other types are not matched; (6) multiplying the accumulated value of the probe support number of the high-tumor-cell-fraction subcloned region by the matching rate of BAF with the copy numbers of allele1 and allele2, and counting a final score S as shown in a formula II, wherein the highest score is an optimal model, and the formula II is R multiplied by M; in the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate. All relevant technical references to optimal model screening or determination in patent 202010567812.X are incorporated herein by reference.
In one implementation of the present application, k1 is 0.1, k2 is 0.4, and the HRD score value threshold is 32.
Corrected HRD score value (1-0.1 × WGD value) x original LST score value + (1-0.4 × WGD value) x original TAI score value + LOH score value.
It should be noted that, according to the single-sample homologous recombination defect evaluation performed by the method of the present application, the HRD score value can be directly obtained according to the tumor sample to be detected, and the HRD (homologous recombination defect) state of the tumor sample to be detected can be determined according to the HRD score value without the need of a paired blood cell sample. For example, the homologous recombination defect determining step includes comparing the HRD score value or corrected HRD score value obtained in the homologous duplication defect score HRD score value obtaining step with the HRD score threshold; if the HRD score value or the corrected HRD score value is greater than or equal to the HRD score threshold, judging that the tumor sample to be detected has the homologous recombination defect; otherwise no homologous recombination defect occurs. Wherein the HRD score threshold is a threshold obtained by training using a known sample as a training set. For example, in one implementation of the present application, the HRD score threshold training method includes using an exhaustive method, traversing all values from 0 to 1 with a step size of 0.1 as correction coefficients of the LST score value and the TAI score value, and screening a combination of coefficients with a BRCA positive HRD positive ratio greater than 0.95, a BRCA negative HRD positive ratio less than 0.5, and a P _ value greater than 0.1; wherein P _ value is the HRD score value for which WGD is equal to the 0 sample set and for which WGD is not equal to the 0 sample set; and training to obtain the HRD score threshold value by combining rank sum test of HRD score values of WGD & lt + & gt sample sets under each traversal coefficient after screening and results of HR, 95% confidence interval and four dimensions of prognostic value obtained by utilizing sample existence data of the training set.
It should be further noted that the first correction coefficient k1 and the second correction coefficient k2 are also correction coefficients obtained by training using a known sample as a training set. For example, in one implementation manner of the present application, the correction coefficient training method includes using an exhaustive method, traversing all values from 0 to 1 with a step size of 0.1 as correction coefficients of the LST score value and the TAI score value, and screening a coefficient combination in which a BRCA positive HRD positive proportion is greater than 0.95, a BRCA negative HRD positive proportion is less than 0.5, and a P _ value is greater than 0.1; wherein P _ value is the HRD score value for which WGD is equal to the 0 sample set and for which WGD is not equal to the 0 sample set; and training to obtain a first correction coefficient k1 and a second correction coefficient k2 by combining rank sum test of HRD score values of WGD-and WGD + sample sets under each screened traversal coefficient and results of HR, 95% confidence interval and four dimensions of prognostic value obtained by using sample existence data of a training set.
In one implementation of the present application, 136 samples, including cancer type ovarian cancer and breast cancer, were compared using the method of the present application for assessing single sample homologous recombination defects, respectively. The result shows that in 99 positive HRD samples, the number of samples detected by the method is 95 samples and the rate of detection of the samples detected by the method is 96 percent; among 37 HRD negatives, 30 samples were detected in the method of the present application in the same manner as in the double samples, and the rate of detection in the same manner as in the double samples was 81%. The method for evaluating the homologous recombination defect of the single sample has high consistency with the positive and negative HRD detection of the paired double samples; therefore, the method can realize the evaluation of the homologous recombination defect of the single sample only by using the tumor sample to be detected.
The terms and their abbreviations of the present application have the following meanings:
HRD score value: and (4) scoring homologous recombination defects.
LOH: loss of heterozygosity in the genome.
TAI: telomeric allele imbalances.
LST: large-panel end migration refers to the number of chromosomal breakpoints between at least 10MB between adjacent regions by filtering out regions smaller than 3 MB.
CNV: are collectively referred to as Copy number variations, i.e., gene Copy number variations.
SNV: single nucleotide variations.
WGD: whole genome replication or whole genome doubling, WGD value is the multiple of whole genome doubling.
The Purity value: the proportion of tumor cells.
Ploid value: mean copy number of tumor cells.
Examples
The method for evaluating the homologous recombination defects of the single sample comprises the following steps:
obtaining a SNV mutation site set of a system: the method comprises the steps of obtaining SNV mutation sites of a tumor sample to be detected, mutation frequency and mutation site depth of each SNV mutation site, annotating the SNV mutation sites, and distinguishing embryonic system SNV mutation and system SNV mutation from the tumor sample to be detected without a control sample to obtain a system SNV mutation site set.
In the embodiment, a Mutect2 module in GATK software is used for detecting information such as mutation frequency and mutation site depth contained in mutation sites of tumor samples, a vcf file is output, mutag software is used for annotating each database of the mutation sites of the system, and a credible SNV mutation site set of the system is screened according to a filtering rule provided by interpretation. Specifically, the germline SNV mutation was filtered out and the systemic SNV mutation was retained according to the following protocol:
a) filtering out common germline SNV mutations by using a crowd database;
b) filtering and removing the 'mutation' caused by the embryonic SNV mutation and the sequencing system error by using a normal human baseline database;
c) distinguishing somatic mutations by utilizing the germ line mutation frequency of somatic mutation frequency distribution different from Mendelian inheritance;
d) filtering germline mutations in combination with CNV/tumor purity information.
Obtaining a system CNV mutation site set: analyzing segment sections of the to-be-detected tumor sample with CNV mutation, the size of each segment section, the number of probes contained in each segment section and the BAF value of each segment section according to a comparison result file of the to-be-detected tumor sample, and annotating CNV mutation sites to obtain a CNV mutation site set of the system.
In the embodiment, a bam file of a tumor sample to be detected is used as an input file of GATK software to detect system CNV mutation, and the software sequentially adopts seven modules of CalculateTargetCoverage, NormalizeSomatoReadCounts, PerformRegification, CallSegments, GetBeyesianeHetCoverage, AllelicNV and ConvertecVResults to detect a CNV mutation area of a system to obtain a system CNV mutation point set.
Obtaining a homologous repeat defect score (HRD score) value: calculating LOH score value, TAI score value and LST score value of the tumor sample to be detected by using the obtained SNV mutation site set and CNV mutation site set.
In the embodiment, the CNV and SNV mutation site sets of the system are used as the input of the scarHRD, and the scores of the three indexes of the LOH score value, the TAI score value and the LST score value of the tumor sample to be detected are output. In order to eliminate the condition that the HRD score value is increased along with the increase of the WGD value, the sensitivity and the accuracy of the evaluation of the homologous recombination defect state are improved; further, the TAI score value and the LST score value are corrected by utilizing the WGD value, and the problem that the TAI score value and the LST score value of a whole genome multiplication sample to be detected are higher is solved. The specific correction method comprises the following steps:
acquiring a WGD value of the whole genome multiplication, wherein the WGD value of the sample to be detected under the optimal model is calculated by using the acquired SNV mutation site set and CNV mutation site set of the system;
and correcting, namely correcting the LST score value (marked as 'original LST score value') obtained in the acquisition step of the HRD score value of the corrected homologous duplication defect of the sample to be detected with WGD by using a first correction coefficient k1, correcting the TAI score value (marked as 'original TAI score value') obtained in the acquisition step of the HRD score value of the homologous duplication defect by using a second correction coefficient k2, and obtaining the LOH score value (marked as 'original LOH score value') obtained in the acquisition step of the HRD score value of the homologous duplication defect, namely the original LOH score value.
Corrected LST score value (1-k1 × WGD value) × original LST score value,
corrected TAI score value (1-k2 × WGD value) × original TAI score value,
corrected HRD score for homologous recombination defect score is the sum of corrected LST score, corrected TAI score and LOH score.
In the step of acquiring the whole genome multiplication WGD value, calculating the WGD value under the optimal model of the sample to be tested comprises the steps of outputting WGD values, purity values and ploidy values simulated under a plurality of models of the sample to be tested by using system CNV mutation and system SNV mutation as input of software ABSOLUTE, screening the predicted models to determine the optimal model, and obtaining the WGD value under the optimal model.
In this example, the method for screening and determining the optimal model refers to patent 202010567812.X, which is as follows: (1) performing quality control on offline data of the tumor and normal samples, comparing the quality-controlled data to a reference genome, performing mutation site detection on paired comparison files of the tumor and normal samples, and performing crowd database annotation on the mutation detection sites; (2) taking the data obtained in the step (1) as an input file of purity prediction software to obtain a purity and copy number information model; (3) whether the purity and copy number information model accords with the normal distribution is further judged through comparing the model doubling probe support number distribution with the whole genome doubling WGD, and the purity and copy number information model which does not accord with the normal distribution is deleted, specifically, if the WGD is 0, the peak value of the doubling probe support number distribution should be at ploidy 2, and if the WGD is 1, the peak value of the doubling probe support number distribution should be at ploidy 2 and ploidy 4; if WGD is 2, the peak of the probe support number distribution of the duplex should be at ploidy 4 and ploidy 8, and so on; if the information model does not accord with the rule, the purity and copy number information model is judged to be not in accordance with normal distribution, and the information model is deleted; (4) performing subclone region screening on the purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high-tumor cell fraction subclone region; (5) carrying out consistency statistics on copy numbers of BAF, allele1 and allele2 obtained by calculation of purity prediction software to obtain the proportion of consistent fragments, wherein the calculation formula is shown as formula I, and the formula is M ═ f ÷ (f + b); in the formula I, M represents the matching rate of BAF to allele1 and allele2 copy numbers, f represents the probe support number of BAF matched with the allele1 and allele2 copy numbers, and b represents the probe support number of BAF not matched with the allele1 and allele2 copy numbers; the condition that BAF matches with the copy numbers of allel 1 and allel 2 is that BAF is 0.5, and the copy number of allel 1 is 2, and the BAF is judged to be matched; or, if the BAF is not equal to 0.5 and the allele1 copy number is not equal to the allele2 copy number, the match is judged; the other types are not matched; (6) multiplying the accumulated value of the probe support number of the high-tumor-cell-fraction subcloned region by the matching rate of BAF with the copy numbers of allele1 and allele2, and counting a final score S as shown in a formula II, wherein the highest score is an optimal model, and the formula II is R multiplied by M; in the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
In this example, 136 samples are used as a training set, and all samples are paired samples; 136 samples of this example were stored and provided by Shenzhen Generache technology Limited. Adopting an exhaustion method, taking all values from 0 to 1 through the step length of 0.1 as correction coefficients of the LST score value and the TAI score value, and screening coefficient combinations with the positive HRD ratio of BRCA positive being more than 0.95, the positive HRD ratio of BRCA negative being less than 0.5 and the P _ value being more than 0.1; where P _ value is the HRD score for a sample set with WGD equal to 0 and the HRD score value for a sample set with WGD not equal to 0.
And finally determining that the correction coefficients of TAI and LST are 0.4 and 0.1 and the threshold is 32 by combining the rank sum test of the HRD-score values of the WGD-sample set and the WGD + sample set under each traversal coefficient after screening and results of HR, 95% confidence interval and prognostic value obtained by using the survival data of the patient.
Therefore, the corrected LST score value, the corrected TAI score value and the corrected HRD score value are calculated as follows:
corrected LST score value (1-0.1 × WGD value) × original LST score value,
corrected TAI score value (1-0.4 × WGD value) × original TAI score value,
corrected HRD score value ═ (1-0.1 × WGD value) × original LST score value + (1-0.4 × WGD value) × original TAI score value + original LOH score value.
In this example, HRD score values of 136 tumor samples were analyzed and corrected HRD scores were calculated according to the above method. Of these, 136 samples covered both ovarian and breast cancer species that frequently develop HRD. In addition, the 136 tumor samples used in this example all had paired blood cell samples, but only the tumor samples were used for the detection.
When the SNV and CNV results of the system are judged by the double samples, embryonic system mutation can be removed according to the control sample, and the CNV of the allele can be obtained by utilizing the heterozygous sites; the system mutation and heterozygous site acquisition mode of the embodiment is different from that of a double sample, so that the dependence of the existing homologous recombination defect evaluation method on blood cell samples is broken.
The partial detection results of the single-sample and double-sample homologous recombination defect scores are shown in Table 1.
TABLE 1 Single and Dual sample homologous recombination Defect Scoring test results
The results in Table 1 show that the method for evaluating single-sample homologous recombination defects has high consistency with real negative and positive results; among 99 HRD positives, the number of samples detected by the double samples is 95, and the rate of detection by the double samples is 96%; among 37 HRD-negative samples, 30 samples were detected in agreement with the double-sample detection, and the agreement rate with the double-sample detection was 81%. The above results demonstrate that the present example enables the scoring of homologous recombination defects in a single sample using only tumor samples.
The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended to limit the present application to the details thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.