CN114242170B - Method and device for evaluating homologous recombination repair defects and storage medium - Google Patents

Method and device for evaluating homologous recombination repair defects and storage medium Download PDF

Info

Publication number
CN114242170B
CN114242170B CN202111572513.6A CN202111572513A CN114242170B CN 114242170 B CN114242170 B CN 114242170B CN 202111572513 A CN202111572513 A CN 202111572513A CN 114242170 B CN114242170 B CN 114242170B
Authority
CN
China
Prior art keywords
sample
whole genome
value
cna
peak
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111572513.6A
Other languages
Chinese (zh)
Other versions
CN114242170A (en
Inventor
黄毅
朱彬彬
陈华东
刘久成
刘青峰
易鑫
杨玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Guiinga Medical Laboratory
Original Assignee
Shenzhen Guiinga Medical Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Guiinga Medical Laboratory filed Critical Shenzhen Guiinga Medical Laboratory
Priority to CN202111572513.6A priority Critical patent/CN114242170B/en
Publication of CN114242170A publication Critical patent/CN114242170A/en
Application granted granted Critical
Publication of CN114242170B publication Critical patent/CN114242170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application discloses an evaluation method, an evaluation device and a storage medium for homologous recombination repair defects. Acquiring low-depth whole genome sequencing data of a sample to be detected, removing a connector, comparing the data with a reference genome, and filtering a PCR repeated sequence; performing quality control and pollution data filtering on the data, and then performing CNV analysis by using ACE software to obtain a total CNA spectrum; calculating an LST value according to the total CNA spectrum, and correcting the LST value by adopting the WGD condition; and judging the HRD to be positive or negative according to the BRCA genotype and the corrected LST value. The method utilizes low-depth whole genome sequencing data to calculate an LST value, and adopts WGD condition to correct the LST value; the method can directly use corrected LST value to evaluate the HRD state without other genome scar markers such as LOH, TAI and the like, and improves the effectiveness and accuracy of PARP inhibitor treatment prediction and prognosis.

Description

Method and device for evaluating homologous recombination repair defects and storage medium
Technical Field
The present application relates to the field of techniques for evaluating homologous recombination repair defects, and in particular, to a method, an apparatus, and a storage medium for evaluating homologous recombination repair defects.
Background
Homologous recombination repair (homologous recombination repair, HRR) is the preferred repair method for DNA double strand breaks (double strand break, DSB). Homologous recombination repair defects (homologous recombination deficiency, HRD) generally refer to HRR dysfunctional states at the cellular level, and can be caused by many factors such as HRR-related gene germ line mutations or somatic mutations, and epigenetic inactivation, often present in various malignant tumors, and are particularly prominent in tumors such as ovarian cancer, breast cancer, pancreatic ductal cancer, and prostate cancer. The HRD can generate specific, quantifiable and stable genome change, the state and the degree of the tumor HRD can be predicted by establishing an evaluation system based on genome feature analysis, and the HRD becomes a novel biomarker for clinically applying poly (ADP-ribose) polymerase (PARP) inhibitors to patients with advanced ovarian cancer, and has guiding value for clinical administration of PARP inhibitors and platinum drugs for tumors such as breast cancer, prostate cancer and the like. Recent expert consensus on detection of biomarkers associated with PARP inhibitors of epithelial ovarian cancer indicates that HRD detection is recommended for guiding the selection of treatment regimens for first-line new diagnosis of ovarian cancer, suggesting that HRD detection should be performed in the diagnosis of ovarian cancer patients, including BRCA gene detection, and that the results have important reference values for efficacy prediction and prognosis of maintenance therapy. Currently, HRD has become an important step in ovarian cancer treatment as the most important biomarker for response to tumor patient chemotherapy sensitivity, post-targeting treatment.
HRD clinical detection describes a specific change in tumor genome, also known as "genomic scar". Since 2012, heterozygosity deletions (loss of heterozygosity, LOH), telomere allele imbalances (telomeric allelic imbalance, TAI), large fragment migration (large-scale state transition, LST), and the like have been used as genomic scar markers to quantify the extent of genomic scarring. LOH is defined as a heterozygous deletion of greater than 15Mb and less than the entire chromosome length; TAI is defined as a segment of a chromosome that extends to one of the subterminals but not more than the centromere and is greater than 11Mb of allelic imbalance; LST is defined as the chromosomal break site between two adjacent regions, and the total number of tumor genome break points can be used to describe the genomic instability. Wherein, the adjacent areas are that the length of the two areas is more than or equal to 10Mb, and the interval of the areas is less than 3Mb. The 3 indexes of LOH, TAI, LST and the like have unique definitions, and can describe the state degree of the cell HRD to a certain extent.
In the prior art, gene chip capture sequencing or high-depth whole genome sequencing is generally adopted to evaluate gene scar of homologous recombination repair defects. However, the problem of large data errors exists in gene chip capture sequencing, and the requirement on sequencing depth of whole genome sequencing is high. In addition, foreign article ShallowhRD detection of homologous recombination deficiency from shallow whole genome sequencing attempts to calculate LST using low depth whole genome data, but real clinical samples have poor performance, and there are problems of higher index scores, low sensitivity, low accuracy, etc. for samples where whole genome replication occurs. Therefore, how to evaluate the gene scar of the homologous recombination repair defect more simply, sensitively and accurately is a problem to be solved at present.
Disclosure of Invention
The purpose of the application is to provide a new method, a device and a storage medium for evaluating homologous recombination repair defects.
In order to achieve the above purpose, the present application adopts the following technical scheme:
the first aspect of the application discloses an evaluation method for homologous recombination repair defects, which comprises the following steps:
the whole genome sequencing data acquisition and comparison step comprises the steps of acquiring low-depth whole genome sequencing downloading data of a sample to be tested, removing joints, comparing the low-depth whole genome sequencing downloading data with a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file;
quality control of whole genome sequencing data, which comprises the steps of carrying out quality analysis on a comparison file to obtain quality information including comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data;
the pollution data filtering step comprises the steps of carrying out pollution rate analysis on sequencing data, analyzing the pollution condition of the sequencing data, and obtaining sequencing data with the pollution rate smaller than a pollution rate threshold value;
copy number variation analysis steps including determining tumor purity of low depth whole genome sequencing samples using ACE software and generating total CNA spectra; wherein the total CNA spectrum comprises sample name, chromosome, start position, end position, copy number and segments information;
Calculating an LST value, namely calculating the LST value of a sample according to a total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff includes taking the first local minimum as the CNA cutoff if the local minimum is greater than 0.025 and less than 0.45; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference value of two adjacent segments is larger than CNA cutoff, the interval is smaller than 3Mb, and the length is larger than 10Mb, then the LST value is added with 1; it will be appreciated that LST is defined as having two adjacent segments less than 3Mb apart and greater than 10Mb in length; therefore, two adjacent segments need to have break points, namely height differences, and CNA cutoff is the calculated height difference; when the height difference is smaller, the corresponding found LST is more, and the LST value is larger; thus, the present application increases the LST of the sample by decreasing CNA cutoff;
The whole genome replication analysis step of the sample to be tested comprises judging whether the whole genome replication of the sample to be tested occurs, if the whole genome replication occurs, and the sample segments are extremely poor to be more than 2, and meanwhile, the number of peaks in a segment density map is more than 4, subtracting 9 from the LST value to obtain a final LST value;
and the step of evaluating the homologous recombination repair defect state, which comprises judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value.
In the present application, when the local minimum is the derivative of the density profile, the derivative is equal to the point where the first local minimum corresponding to 0 is located.
It should be noted that, the homologous recombination repair defect evaluation method of the present application can directly use low-depth whole genome sequencing data to calculate the LST value, and use the whole genome replication condition to correct the LST value, so that the corrected LST value can be directly combined with BRCA genotype alone for homologous recombination repair defect state evaluation; and, because the LST value after correcting calculates more accurately, has improved its effectiveness and accuracy as PARP inhibitor treatment and intermediate reference data of prognosis.
In one implementation of the present application, low depth whole genome sequencing of the whole genome sequencing data acquisition and comparison steps refers to a sequencing depth of no more than 5.
It should be noted that low-depth whole genome sequencing refers to sequencing with a sequencing depth of not more than 5, and further, the sequencing depth may not exceed 3.
In one implementation of the present application, the sequence with the removed linker was aligned to reference genome hg19 using bwa-mem2 software.
In one implementation mode of the application, the comparison rate of the quality control step of the whole genome sequencing data is more than 95%, and the depth is more than 0.8, so that the whole genome sequencing data is qualified; the GC content and the repetition rate are not threshold, and are only used for carrying out auxiliary judgment on the sample quality.
It is to be understood that the above specific values are merely criteria specifically employed in one implementation of the present application, and that the above specific values may be adjusted as desired under more stringent or relaxed requirements.
In one implementation of the present application, the contamination rate threshold of the contamination data filtering step is a contamination rate of the evaluation sample using a population allele frequency construction model.
In one implementation of the present application, the contamination rate threshold is 0.1.
It will be appreciated that a pollution rate threshold of 0.1 is merely a threshold obtained by modeling evaluation according to a specific population allele frequency in one implementation of the present application; the contamination rate threshold values obtained with different populations or models may be different under the same inventive concept, and are not particularly limited herein.
In one implementation manner of the application, in the whole genome replication analysis step of the sample to be tested, the method for judging whether the sample to be tested is subjected to whole genome replication includes judging whether the sample is subjected to whole genome replication according to the segment density distribution diagram and the segment extreme condition, wherein the judgment rule includes,
a. when the sample segments are less than 1, the sample does not undergo whole genome replication;
b. when the sample segments are extremely poor to be more than 1 and the number of peaks is less than 3, the sample does not generate whole genome replication;
c. when the sample segments are extremely poor to be more than 1 and the number of peaks is more than or equal to 3, the sample is subjected to whole genome replication;
d. full genome replication of the sample occurs when the sample segments are minimally greater than 9 and the number of peaks is greater than or equal to 2.
The specific method for judging whether the whole genome of the sample to be tested is copied can accurately and effectively judge the whole genome copy condition of the low-depth whole genome sequencing sample, so that the method is better used for LST value correction.
In one implementation manner of the application, the whole genome replication analysis step of the sample to be tested further includes judging peaks displayed by the segments density distribution map, wherein the judgment criteria include (1) that only peaks greater than 15% of the maximum peak value are counted as peaks; (2) When the number of the preliminary judgment peaks is larger than 2, judging according to the following rules: if the distance between the peak and the trough at the left side and the right side is greater than 4% of the maximum peak value, the peak participates in counting the number of the peak values; if the distance between the peak and the trough on the left side and the right side is less than 4% of the maximum peak value, the peak does not participate in counting the number of the peak values; if the distance between the peak and the trough on only one side of the left and right is less than 4% of the maximum peak, the next peak is still the same, then it is noted as a peak.
The judgment standard of the peaks displayed by the segment density distribution diagram further improves the accuracy and the effectiveness of the whole genome replication judgment by judging and processing the special condition peaks; the method is particularly suitable for judging the whole genome replication condition of the low-depth whole genome sequencing sample.
In one implementation manner of the application, the step of evaluating the homologous recombination repair defect state specifically judges that the homologous recombination repair defect state is positive regardless of the LST value when the BRCA genotype is of a mutant type; when the BRCA genotype is wild type, the LST value is larger than or equal to the HRD biological threshold value, the homologous recombination repair defect state is judged to be positive, otherwise, the homologous recombination repair defect state is judged to be negative.
In one implementation of the application, the HRD biological threshold is a threshold determined by HRD positivity for 95% of BRCA mutant samples in the model data;
in one implementation of the present application, the HRD biological threshold is 15.
It is understood that HRD biological threshold of 15 is merely a threshold obtained from specific model data in one implementation of the present application; under the same inventive concept, the specific HRD biological thresholds that may be obtained using different model data are different and are not specifically limited herein.
The second aspect of the application discloses an evaluation device for homologous recombination repair defects, which comprises a whole genome sequencing data acquisition and comparison module, a whole genome sequencing data quality control module, a pollution data filtering module, a copy number variation analysis module, an LST value calculation module, a sample whole genome replication analysis module to be tested and a homologous recombination repair defect state evaluation module;
the whole genome sequencing data acquisition and comparison module comprises low-depth whole genome sequencing lower machine data for acquiring a sample to be tested, removing joints, comparing the low-depth whole genome sequencing lower machine data to a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file;
the whole genome sequencing data quality control module comprises a quality analysis module for comparing the files to obtain quality information including comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data;
the pollution data filtering module is used for analyzing the pollution rate of the sequencing data and acquiring the sequencing data with the pollution rate smaller than a pollution rate threshold value;
the copy number variation analysis module comprises a step of determining the tumor purity of a low-depth whole genome sequencing sample by adopting ACE software, and generating a total CNA spectrum, wherein the total CNA spectrum comprises a sample name, a chromosome, a starting position, a terminating position, a copy number and segment information;
The LST value calculation module comprises a LST value calculation module used for calculating a sample according to a total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating CNAcutoff includes taking the first local minimum as CNA cutoff if the local minimum is greater than 0.025 and less than 0.45; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference value of two adjacent segments is larger than CNA cutoff, the interval is smaller than 3Mb, and the length is larger than 10Mb, then the LST value is added with 1;
The whole genome replication analysis module of the sample to be tested comprises a module for judging whether whole genome replication occurs in the sample to be tested, if the whole genome replication occurs in the sample to be tested, the sample segments are extremely poor to be more than 2, and meanwhile, the number of peaks in a segment density map is more than 4, the LST value is reduced by 9, and the LST value is used as a final LST value;
the homologous recombination repair defect state evaluation module is used for judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value.
It should be noted that, the homologous recombination repair defect evaluation device of the present application actually realizes each step in the homologous recombination repair defect evaluation method of the present application through each module respectively; thus, specific limitations of each module may be referred to herein as methods for assessing homologous recombination repair defects, and are not described in detail herein. For example, in a whole genome sequencing data quality control module, the comparison rate is more than 95%, and the depth is more than 0.8, which is qualified; in the pollution data filtering module, the pollution rate threshold is a pollution rate of the sample estimated by using a population allele frequency construction model. For example, a specific rule for determining whether the sample is subjected to whole genome replication by using the extremely poor condition of segments, a rule for determining the state of homologous recombination repair defects by using the peak criteria displayed by the segment density distribution map, and the like may be referred to as the method for evaluating homologous recombination repair defects in the present application.
A third aspect of the present application discloses an apparatus for evaluating a homologous recombination repair defect, the apparatus comprising a memory and a processor; the memory includes a memory for storing a program; the processor includes an evaluation method for implementing the homologous recombination repair defect of the present application by executing a program stored in the memory.
A fourth aspect of the present application discloses a computer readable storage medium having stored therein a program executable by a processor to implement the method of evaluating a homologous recombination repair defect of the present application.
Due to the adoption of the technical scheme, the beneficial effects of the application are that:
according to the homologous recombination repair defect evaluation method and device, LST value calculation is carried out by using low-depth whole genome sequencing data, and LST value correction is carried out by adopting the whole genome replication condition; the LST value corrected by the method can be directly used for carrying out homologous recombination repair defect state assessment without other genome scar markers such as LOH, TAI and the like. And, because the LST value after correcting calculates more accurately, has improved its effectiveness and accuracy as PARP inhibitor treatment and intermediate reference data of prognosis.
Drawings
FIG. 1 is a flow chart of a method for evaluating a defect in a homologous recombination repair according to an embodiment of the present application;
FIG. 2 is a block diagram of a device for evaluating a defect in a homologous recombination repair according to an embodiment of the present application;
fig. 3 is a graph of survival of 40 samples in the examples of the present application.
Detailed Description
The present application is described in further detail below with reference to the accompanying drawings by way of specific embodiments. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted, or substituted for other devices, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.
At present, no accurate and effective method for evaluating homologous recombination repair defects based on low-depth whole genome sequencing exists. Although there are few foreign papers reporting the use of low depth whole genome data to calculate LST, real clinical samples have poor performance, low sensitivity, low accuracy.
The present application creatively proposes (1) to replace the control freeC in the prior literature report with ACE for gene Copy Number Variation (CNV) analysis. (2) calculating a CNA cutoff value with a new algorithm: setting CNA cutoff to the small inflection point appearing in the first peak of the density curve increases the LST of the sample by decreasing CNA cutoff. CNA cutoff is determined by segment fragment density map. In the original algorithm, when CNA cutoff is 0.45, the LST score value is smaller, which affects the judgment of HRD. The present application optimizes for this case: when the local minimum is greater than 0.45, small inflection points with reduced derivative values but unchanged positive and negative values appear in the partial density map before 0.45; at this time, the derivative of the density map between 0.025 and 0.45 is calculated, and the difference corresponding to the minimum value of the derivative is the new CNA cutoff. (3) judging WGD, correcting LST.
Through the optimization and improvement of the three aspects, the prediction of the effectiveness of PARPi administration is improved. It will be appreciated that based on the above improvements, the copy number variation fitting results to the patient are more accurate, and thus the LST score calculation is more accurate, and finally the prediction of the effectiveness of the PARPi medication can be more accurate.
Based on the above researches and knowledge, the application creatively provides an evaluation method of homologous recombination repair defects, as shown in fig. 1, which comprises a whole genome sequencing data acquisition and comparison step 11, a whole genome sequencing data quality control step 12, a pollution data filtering step 13, a copy number variation analysis step 14, an LST value calculation step 15, a whole genome replication analysis step 16 of a sample to be tested and a homologous recombination repair defect state evaluation step 17.
The whole genome sequencing data acquisition and comparison step 11 comprises the steps of acquiring low-depth whole genome sequencing lower machine data of a sample to be tested, removing joints, comparing the low-depth whole genome sequencing lower machine data with a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file.
In one implementation of the present application, specifically, bwa-mem2 software is used to perform reference genome alignment on the sequence from which UMI is intercepted, so as to generate a SAM file, where the reference genome is hg19; fixmate, sort, markdup to the comparison result by using a samtools software package, specifically samtools fixmate-m $ { sample }, SAM $ { sample } -fixmate.bam, using the SAM file as an input file, and repairing header file information to obtain a fixmate.bam file; samtools sort, $ { sample } -fixmate. Bam-o $ { sample } -sort. Bam, ordering the fixmate. Bam files for next analysis; samtools mark dup-r $ { sample } -sort. Bam $ { sample } -mark dup. Bam, filtering out repeated sequences generated by PCR from the ordered bam file to obtain a final bam file for further analysis.
And a quality control step 12 of whole genome sequencing data, which comprises performing quality analysis on the comparison file to obtain quality information including comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data.
In one implementation manner of the present application, the bamdst software is used to perform quality analysis on the bam file finally generated in the previous step, and the software provides the required bam file and the bed file (i.e. the bit information file) to generate quality information of the corresponding sample, including the comparison rate, the sequencing depth, the GC content and the repetition rate. The comparison rate is more than 95 percent, and the depth is more than 0.8 percent, which is qualified. The GC content and the repetition rate have no threshold value, and only the sample quality is judged in an auxiliary way.
And a pollution data filtering step 13, which comprises the step of analyzing the pollution rate of the sequencing data, analyzing the pollution condition of the sequencing data, and obtaining the sequencing data with the pollution rate smaller than a pollution rate threshold value.
In one implementation of the present application, the contamination rate analysis software is verifybam id2.0.1, and the finally generated bam file is processed by using verifybam id2.0.1 software, and the contamination rate of the sample is evaluated by using a population allele frequency construction model, and the contamination rate is less than 0.1 and is qualified.
A copy number variation analysis step 14 comprising determining tumor purity of the low depth whole genome sequencing sample using ACE software and generating a total CNA spectrum; the total CNA spectrum contains the sample name, chromosome, start position, end position, copy number and segment information.
An LST value calculation step 15, which comprises calculating LST values of samples according to a total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff includes taking the first local minimum as the CNA cutoff if the local minimum is greater than 0.025 and less than 0.45; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference between two adjacent segments is greater than CNA cutoff, the interval is less than 3Mb, and the length is greater than 10Mb, then the LST value is increased by 1. In one implementation of the present application, the LST value is calculated specifically using the shrllowhrd software.
The whole genome replication analysis step 16 of the sample to be tested includes determining whether whole genome replication occurs in the sample to be tested, if the whole genome replication occurs, and the sample segments are extremely poor to be greater than 2, and meanwhile, the number of peaks in the segment density map is greater than 4, the LST value is subtracted by 9, and the final LST value is obtained.
In one implementation manner of the application, the whole genome replication analysis step of the sample to be tested further includes judging peaks displayed by the segments density distribution map, wherein the judgment criteria include (1) that only peaks greater than 15% of the maximum peak value are counted as peaks; (2) When the number of the preliminary judgment peaks is larger than 2, judging according to the following rules: if the distance between the peak and the trough at the left side and the right side is greater than 4% of the maximum peak value, the peak participates in counting the number of the peak values; if the distance between the peak and the trough on the left side and the right side is less than 4% of the maximum peak value, the peak does not participate in counting the number of the peak values; if the distance between the peak and the trough on only one side of the left and right is less than 4% of the maximum peak, the next peak is still the same, then it is noted as a peak.
The method for judging whether the sample to be tested is subjected to whole genome replication comprises the steps of judging whether the sample is subjected to whole genome replication according to the extremely poor condition of segments according to the segment density distribution diagram, wherein the judgment rule is as follows:
a. When the sample segments are less than 1, the sample does not undergo whole genome replication;
b. when the sample segments are extremely poor to be more than 1 and the number of peaks is less than 3, the sample does not generate whole genome replication;
c. when the sample segments are extremely poor to be more than 1 and the number of peaks is more than or equal to 3, the sample is subjected to whole genome replication;
d. full genome replication of the sample occurs when the sample segments are minimally greater than 9 and the number of peaks is greater than or equal to 2.
And a homologous recombination repair defect state evaluation step 17, which comprises judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value.
In one implementation manner of the present application, the rule for judging the defect state of homologous recombination repair is: when the BRCA genotype is of a mutant type, the homologous recombination repair defect state is judged to be positive no matter what the LST value is; when the BRCA genotype is wild type, the LST value is larger than or equal to the HRD biological threshold value, the homologous recombination repair defect state is judged to be positive, otherwise, the homologous recombination repair defect state is judged to be negative. And, with 95% of BRCA mutant samples in the model data all being HRD positive, determining the HRD biological threshold as 15; when the BRCA genotype is wild type, the LST value is greater than or equal to 15, the homologous recombination repair defect state is judged to be positive, otherwise, the homologous recombination repair defect state is judged to be negative.
Those skilled in the art will appreciate that all or part of the functions of the above-described methods may be implemented by means of hardware, or by means of a computer program. When all or part of the functions of the above method are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and all or part of the functions in the above methods may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and executing the program in the memory by a processor.
Therefore, based on the method for evaluating the homologous recombination repair defect of the present application, the present application proposes an apparatus for evaluating the homologous recombination repair defect, as shown in fig. 2, which includes a whole genome sequencing data acquisition and comparison module 21, a whole genome sequencing data quality control module 22, a pollution data filtering module 23, a copy number variation analysis module 24, an LST value calculation module 25, a sample to be tested whole genome replication analysis module 26, and a homologous recombination repair defect status evaluation module 27.
The whole genome sequencing data acquisition and comparison module 21 comprises low-depth whole genome sequencing lower machine data for acquiring a sample to be tested, removes joints, compares the low-depth whole genome sequencing lower machine data to a reference genome, and filters repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file. For example, reference bwa-mem2 software aligns the sequence to reference genome hg19 to generate a SAM file; using the SAM file as input, the comparison results were sequentially subjected to fixmate, sort, markdup by referring to the samtools software package, and a bam file with PCR amplified repetitive sequences filtered was obtained for the next analysis.
The whole genome sequencing data quality control module 22 is used for carrying out quality analysis on the comparison file to obtain quality information including comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data. For example, the mass analysis is performed with reference to the bamdst software, and data with a comparison rate of more than 95% and a depth of more than 0.8 are regarded as being qualified.
The pollution data filtering module 23 is used for analyzing the pollution rate of the sequencing data and analyzing the pollution condition of the sequencing data to obtain the sequencing data with the pollution rate smaller than the pollution rate threshold value. For example, the pollution rate analysis is performed with reference to the pollution rate analysis software verifybamid2.0.1, and a pollution rate less than 0.1 is defined as acceptable.
Copy number variation analysis module 24, includes determining tumor purity of low depth whole genome sequencing samples using ACE software and generating a total CNA spectrum containing sample name, chromosome, start position, end position, copy number, and segments information.
The LST value calculation module 25 includes a module for calculating LST values of samples according to total CNA spectrum, specifically, deleting segments smaller than 3M, and drawing a segment density distribution map; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff includes taking the first local minimum as the CNA cutoff if the local minimum is greater than 0.025 and less than 0.45; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference between two adjacent segments is greater than CNA cutoff, the interval is less than 3Mb, and the length is greater than 10Mb, then the LST value is increased by 1. For example, LST value calculations are made with reference to the shallowHRD software.
The genome-wide replication analysis module 26 of the sample to be tested includes a module for determining whether the sample to be tested is subjected to genome-wide replication, and subtracting 9 LST values as final LST values if the sample to be tested is subjected to genome-wide replication and the sample segments are extremely poor to be greater than 2 and the number of peaks in the segment density distribution map is greater than 4.
In one implementation of the present application, the genome-wide replication analysis module 26 further includes means for determining peaks displayed in the segments density profile, where the criteria include (1) only peaks greater than 15% of the maximum peak are counted as peaks; (2) When the number of the preliminary judgment peaks is larger than 2, judging according to the following rules: if the distance between the peak and the trough at the left side and the right side is greater than 4% of the maximum peak value, the peak participates in counting the number of the peak values; if the distance between the peak and the trough on the left side and the right side is less than 4% of the maximum peak value, the peak does not participate in counting the number of the peak values; if the distance between the peak and the trough on only one side of the left and right is less than 4% of the maximum peak, the next peak is still the same, then it is noted as a peak. The judgment rule for judging whether the sample is subjected to whole genome replication or not according to the extremely poor condition of segments is the same as the evaluation method of the homologous recombination repair defect of the application.
The homologous recombination repair defect status evaluation module 27 is configured to determine whether the homologous recombination repair defect status is positive or negative according to the BRCA genotype and the final LST value. Likewise, the specific judgment rule is the same as the evaluation method of the homologous recombination repair defect.
There is also provided in another implementation of the present application an apparatus for evaluating a homologous recombination repair defect, the apparatus comprising a memory and a processor; a memory including a memory for storing a program; a processor comprising a program for implementing the following method by executing a program stored in a memory: the whole genome sequencing data acquisition and comparison step comprises the steps of acquiring low-depth whole genome sequencing downloading data of a sample to be tested, removing joints, comparing the low-depth whole genome sequencing downloading data with a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file; quality control of whole genome sequencing data, which comprises the steps of carrying out quality analysis on a comparison file to obtain quality information including comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data; the pollution data filtering step comprises the steps of carrying out pollution rate analysis on sequencing data, analyzing the pollution condition of the sequencing data, and obtaining sequencing data with the pollution rate smaller than a pollution rate threshold value; copy number variation analysis steps including determining tumor purity of low depth whole genome sequencing samples using ACE software and generating total CNA spectra; wherein the total CNA spectrum comprises sample name, chromosome, start position, end position, copy number and segments information; calculating an LST value, namely calculating the LST value of a sample according to a total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff includes taking the first local minimum as the CNA cutoff if the local minimum is greater than 0.025 and less than 0.45; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference value of two adjacent segments is larger than CNA cutoff, the interval is smaller than 3Mb, and the length is larger than 10Mb, then the LST value is added with 1; the whole genome replication analysis step of the sample to be tested comprises judging whether the whole genome replication of the sample to be tested occurs, if the whole genome replication occurs, and the sample segments are extremely poor to be more than 2, and meanwhile, the number of peaks in a segment density map is more than 4, subtracting 9 from the LST value to obtain a final LST value; and the step of evaluating the homologous recombination repair defect state, which comprises judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value.
There is also provided in another implementation of the present application a computer readable storage medium including a program executable by a processor to implement a method of: the whole genome sequencing data acquisition and comparison step comprises the steps of acquiring low-depth whole genome sequencing downloading data of a sample to be tested, removing joints, comparing the low-depth whole genome sequencing downloading data with a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file; quality control of whole genome sequencing data, which comprises the steps of carrying out quality analysis on a comparison file to obtain quality information including comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data; the pollution data filtering step comprises the steps of carrying out pollution rate analysis on sequencing data, analyzing the pollution condition of the sequencing data, and obtaining sequencing data with the pollution rate smaller than a pollution rate threshold value; copy number variation analysis steps including determining tumor purity of low depth whole genome sequencing samples using ACE software and generating total CNA spectra; wherein the total CNA spectrum comprises sample name, chromosome, start position, end position, copy number and segments information; calculating an LST value, namely calculating the LST value of a sample according to a total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff includes taking the first local minimum as the CNA cutoff if the local minimum is greater than 0.025 and less than 0.45; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference value of two adjacent segments is larger than CNA cutoff, the interval is smaller than 3Mb, and the length is larger than 10Mb, then the LST value is added with 1; the whole genome replication analysis step of the sample to be tested comprises judging whether the whole genome replication of the sample to be tested occurs, if the whole genome replication occurs, and the sample segments are extremely poor to be more than 2, and meanwhile, the number of peaks in a segment density map is more than 4, subtracting 9 from the LST value to obtain a final LST value; and the step of evaluating the homologous recombination repair defect state, which comprises judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value.
Examples
The tumor samples referred to in this example were provided by Beijing g Ji-Chemicals laboratory Co. A total of 40 real clinical specimens with PARP efficacy, 20 of which were wild-type BRCA and 20 of which were mutated. The method for evaluating the homologous recombination repair defect comprises the following steps:
(1) Nucleic acid extraction
In this example, DNA of 40 samples was extracted from each tissue sample, a pool was created by TA cloning of the linker, and whole genome sequencing was performed with Gene+Seq2000, with a data size of 5G. Raw data are obtained.
Extraction DNA was extracted from formalin-fixed-paraffin embedded (FFPE) tissue using a silica gel membrane method using a Generead DNA FFPE kitMinElute nucleic acid extraction column.
After extraction, DNA is broken by using a Diagenode Bioruptor Pico breaking instrument, so that the DNA is uniformly broken to 200-250bp.
The DNA after the cleavage is subjected to end repair, addition of "A", addition of Gene+Seq2000 bp UID linker, introduction of Index and 10cycles PCR, and gradually constructed into a sequencing recognizable whole genome DNA library.
After the whole genome DNA library is pooling, the linear library is subjected to thermal denaturation, cyclization, digestion and termination reaction to form a loop.
The circularized library was sequenced on a Gene+Seq2000 platform to a sequencing depth of 5.
(2) Whole genome sequencing data acquisition and alignment
The method comprises the steps of obtaining low-depth whole genome sequencing unloading data of a sample to be tested and removing joints. Comparing the sequence with the cut UMI by using bwa-mem2 software to generate a SAM file, wherein the reference genome is hg19, fixmate, sort, markdup the comparison result by using a samtools software package, and obtaining a fixmate file by using the SAM file as an input file and repairing header file information; samtools sort, $ { sample } -fixmate. Bam-o $ { sample } -sort. Bam, ordering the fixmate. Bam files for next analysis; samtools mark dup-r $ { sample } -sort. Bam $ { sample } -mark dup. Bam, filtering out repeated sequences generated by PCR from the ordered bam file to obtain a final bam file for further analysis.
(3) Quality control of whole genome sequencing data
And (3) carrying out quality analysis on the finally generated bam file in the last step by using bamdst software, wherein the software provides the required bam file and the bed file (namely the bit information file) to generate quality information of a corresponding sample, including comparison rate, sequencing depth, GC content and repetition rate. The comparison rate is more than 95 percent, and the depth is more than 0.8, which is qualified. The GC content and the repetition rate have no threshold value, and only the sample quality is judged in an auxiliary way. All 40 samples in this example were acceptable.
(4) Pollution data filtering
And (3) carrying out pollution rate analysis on the data, wherein the pollution rate analysis software is VerifyBamID2.0.1, and analyzing the pollution condition of the sample. Specifically, the bam file finally generated in "(2) whole genome sequencing data acquisition and alignment" was processed using verifybamad 2.0.1 software, which evaluates the contamination rate of samples by using a population allele frequency construction model. In the embodiment, the pollution rate is smaller than 0.1, and 40 samples are qualified.
(5) Copy number variation analysis
The CNV analysis module adopts ACE software. This section is mainly used to determine tumor purity of low depth WGS samples and generate total CNA spectra for each sample. the total CNA spectrum contains information for each 50kb window, including sample name, chromosome, start position, end position, copy number and fragment information (i.e., fragments) for further analysis.
The analysis results showed that the tumor purity of 40 samples of this example was between 0.27 and 0.87, as shown in Table 2. The partial total CNA spectrum information is shown in table 1.
Table 1 total CNA spectrum information
Sample chr start end copynumbers segments
179008702TD 1 850001 900000 0.531625013 0.546180615
179008702TD 1 900001 950000 0.54031214 0.546180615
179008702TD 1 950001 1000000 0.548481646 0.546180615
179008702TD 1 1000001 1050000 0.600503021 0.546180615
179008702TD 1 1050001 1100000 0.649748317 0.546180615
179008702TD 1 1100001 1150000 0.644205327 0.546180615
(6) LST value calculation
The LST value of the sample is calculated by the total CNA spectrum generated by "(5) copy number variation analysis", the LST is calculated by using the shrllowHRD software in this example, and fragments smaller than 3M are deleted. When the neighboring segments are smaller than CNA cutoff, then 2 segments are fit. CNA cutoff is determined by segment fragment density map. Taking the first local minimum value of the density distribution diagram as CNA cutoff; if there is no local minimum between the specified interval 0.025 and 0.45, then CNA cutoff is set to the small inflection point appearing in the first peak of the segments density curve, increasing the LST of the sample by decreasing CNA cutoff. The rule for calculating the CNA cutoff is as follows, if the local minimum is greater than 0.025 and less than 0.45, the first local minimum is taken as the CNA cutoff; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference between two adjacent segments is greater than CNA cutoff, the interval is less than 3Mb, and the length is greater than 10Mb, then the LST value is increased by 1. The LST values of the 40 samples of this example are shown in table 2.
TABLE 2 tumor purity and LST values
Sample of Tumor purity BRCA Condition LST value Sample of Tumor purity BRCA Condition LST value
1 0.38 + 40 21 0.43 - 30
2 0.50 + 29 22 0.32 + 32
3 0.58 - 10 23 0.81 - 19
4 0.45 - 20 24 0.29 + 29
5 0.64 - 38 25 0.85 - 24
6 0.36 - 33 26 0.82 + 19
7 0.87 + 19 27 0.78 + 19
8 0.80 - 25 28 0.81 + 24
9 0.38 - 26 29 0.45 - 19
10 0.80 + 28 30 0.48 + 22
11 0.27 - 33 31 0.30 + 17
12 0.34 - 33 32 0.69 - 24
13 0.40 - 0 33 0.50 + 30
14 0.44 - 34 34 0.45 + 28
15 0.61 + 23 35 0.54 + 20
16 0.61 - 23 36 0.87 + 29
17 0.77 + 22 37 0.71 - 21
18 0.85 - 9 38 0.44 + 26
19 0.40 - 22 39 0.70 + 28
20 0.64 + 30 40 0.64 - 28
In Table 2, the column "BRCA status" + "indicates mutant type and" - "indicates wild type.
(7) Whole genome replication analysis of test samples
In this example, whether WGD occurs in a sample is determined by the extremely poor condition of copy number fragment information. Judging and formulating a rule of peaks in the density distribution diagram, judging and processing special condition peaks and the like; and finally, judging whether the sample has WGD or not according to the extremely bad condition of the number of the comprehensive peaks and the fragment information. The method comprises the following steps:
judging peaks displayed on the segments density distribution diagram, wherein the judging standards comprise (1) that only peaks with the peak value being more than 15% of the maximum peak value are counted as peaks; (2) When the number of the preliminary judgment peaks is larger than 2, judging according to the following rules: if the distance between the peak and the trough at the left side and the right side is greater than 4% of the maximum peak value, the peak participates in counting the number of the peak values; if the distance between the peak and the trough on the left side and the right side is less than 4% of the maximum peak value, the peak does not participate in counting the number of the peak values; if the distance between the peak and the trough on only one side of the left and right is less than 4% of the maximum peak, the next peak is still the same, then it is noted as a peak.
The method for judging whether the sample to be tested is subjected to whole genome replication comprises the steps of judging whether the sample is subjected to whole genome replication according to the extremely poor condition of segments according to the segment density distribution diagram, wherein the judgment rule is as follows:
a. when the sample segments are less than 1, the sample does not undergo whole genome replication;
b. when the sample segments are extremely poor to be more than 1 and the number of peaks is less than 3, the sample does not generate whole genome replication;
c. when the sample segments are extremely poor to be more than 1 and the number of peaks is more than or equal to 3, the sample is subjected to whole genome replication;
d. full genome replication of the sample occurs when the sample segments are minimally greater than 9 and the number of peaks is greater than or equal to 2.
If there is full genome replication, and the sample segments are very poorly greater than 2, and the number of peaks in the segment density map is greater than 4, then the LST value is decremented by 9 as the final LST value. The results of the genome-wide replication decisions and the final LST values of the 40 samples of this example are shown in table 3.
TABLE 3 Whole genome replication and corrected LST values
Figure BDA0003424314730000161
Figure BDA0003424314730000171
In Table 3, 0 represents no WGD, and 1 represents WGD.
(8) Homologous recombination repair defect status assessment
And judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value. And determining that the biological threshold of the HRD is 15 by determining that 95% of BRCA mutant samples in the model data are positive to the HRD. Specifically, the homologous recombination repair defect state evaluation rule is: when BRCA is mutant, HRD is judged positive regardless of LST value being any value; when BRCA is wild type, LST score is greater than or equal to 15, and HRD is judged to be positive, otherwise, is negative.
The results of the homologous recombination repair defect state evaluation show that 5 BRCA wild type patients in 40 samples are judged to be HRD negative, 35 patients are HRD positive, and the results are consistent with the actual conditions. According to patient PFS information provided clinically, a survival curve is drawn by using a survival, survminer, dplyr packet in r, as shown in fig. 3, P-value=0.0011 and hr=0.193, which illustrates that the evaluation method of the homologous recombination repair defect of the embodiment predicts the curative effect (PFS) of the PARPi, and a statistically significant difference exists between samples with positive and negative homologous recombination repair defect states, namely, the evaluation of the homologous recombination repair defect states of the embodiment can be used to provide a more accurate and effective reference basis for the prediction of the curative effect (PFS) of the PARPi.
The foregoing is a further detailed description of the present application in connection with the specific embodiments, and it is not intended that the practice of the present application be limited to such descriptions. It will be apparent to those skilled in the art to which the present application pertains that several simple deductions or substitutions may be made without departing from the spirit of the present application.

Claims (18)

1. An evaluation method for homologous recombination repair defects is characterized by comprising the following steps: comprises the steps of,
the whole genome sequencing data acquisition and comparison step comprises the steps of acquiring low-depth whole genome sequencing downloading data of a sample to be tested, removing joints, comparing the low-depth whole genome sequencing downloading data with a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file;
The quality control step of the whole genome sequencing data comprises the steps of carrying out quality analysis on the comparison file to obtain quality information comprising comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data;
the pollution data filtering step comprises the steps of carrying out pollution rate analysis on sequencing data, analyzing the pollution condition of the sequencing data, and obtaining sequencing data with the pollution rate smaller than a pollution rate threshold value;
a copy number variation analysis step comprising determining tumor purity of a low depth whole genome sequencing sample using ACE software and generating a total CNA spectrum comprising sample name, chromosome, start position, end position, copy number and segments information;
calculating an LST value, namely calculating the LST value of a sample according to the total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking the first local minimum value as CNAcutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNAcutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff is as follows, if the local minimum is greater than 0.025 and less than 0.45, the first local minimum is taken as the CNA cutoff; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference value of two adjacent segments is larger than CNA cutoff, the interval is smaller than 3Mb, and the length is larger than 10Mb, then the LST value is added with 1;
The whole genome replication analysis step of the sample to be tested comprises judging whether the whole genome replication of the sample to be tested occurs, if the whole genome replication occurs, and the sample segments are extremely poor to be more than 2, and meanwhile, the number of peaks in a segment density map is more than 4, subtracting 9 from the LST value to obtain a final LST value;
the step of evaluating the homologous recombination repair defect state, which comprises judging whether the homologous recombination repair defect state is positive or negative according to BRCA genotype and final LST value;
in the whole genome replication analysis step of the sample to be tested, the method for judging whether the sample to be tested has whole genome replication comprises judging whether the sample has whole genome replication according to the segment density distribution diagram and the segment extreme difference condition, wherein the judging rule comprises,
a. when the sample segments are less than 1, the sample does not undergo whole genome replication;
b. when the sample segments are extremely poor to be more than 1 and the number of peaks is less than 3, the sample does not generate whole genome replication;
c. when the sample segments are extremely poor to be more than 1 and the number of peaks is more than or equal to 3, the sample is subjected to whole genome replication;
d. when the sample segments are extremely poor to be more than 9 and the number of peaks is more than or equal to 2, the sample is subjected to whole genome replication;
In the step of evaluating the homologous recombination repair defect state, the rule for judging the homologous recombination repair defect state comprises that when the BRCA genotype is of a mutant type, the homologous recombination repair defect state is judged to be positive no matter how the LST value is; when the BRCA genotype is wild type, the LST value is larger than or equal to the HRD biological threshold value, judging that the homologous recombination repair defect state is positive, otherwise, judging that the homologous recombination repair defect state is negative;
the HRD biological threshold is a threshold determined for HRD positivity for 95% of BRCA mutant samples in the model data.
2. The evaluation method according to claim 1, characterized in that: in the whole genome sequencing data acquisition and comparison step, the low-depth whole genome sequencing refers to a sequencing depth of not more than 5.
3. The evaluation method according to claim 1, characterized in that: the sequence with the removed linker was aligned to the reference genome hg19 using bwa-mem2 software.
4. The evaluation method according to claim 1, characterized in that: in the whole genome sequencing data quality control step, the comparison rate is more than 95%, and the depth is more than 0.8, so that the whole genome sequencing data quality control is qualified.
5. The evaluation method according to claim 1, characterized in that: in the contamination data filtering step, the contamination rate threshold is a contamination rate of the sample estimated using a population allele frequency construction model.
6. The evaluation method according to claim 1, characterized in that: the contamination rate threshold is 0.1.
7. The evaluation method according to claim 1, characterized in that: the whole genome replication analysis step of the sample to be detected further comprises judging peaks displayed by the segments density distribution diagram, wherein the judging criteria comprise (1) that only peaks larger than 15% of the maximum peak value are counted as peaks; (2) When the number of the preliminary judgment peaks is larger than 2, judging according to the following rules: if the distance between the peak and the trough at the left side and the right side is greater than 4% of the maximum peak value, the peak participates in counting the number of the peak values; if the distance between the peak and the trough on the left side and the right side is less than 4% of the maximum peak value, the peak does not participate in counting the number of the peak values; if the distance between the peak and the trough on only one side of the left and right is less than 4% of the maximum peak, the next peak is still the same, then it is noted as a peak.
8. The assessment method according to any one of claims 1 to 7, wherein: the HRD biological threshold is 15.
9. An evaluation device for homologous recombination repair defects, which is characterized in that: the system comprises a whole genome sequencing data acquisition and comparison module, a whole genome sequencing data quality control module, a pollution data filtering module, a copy number variation analysis module, an LST value calculation module, a sample whole genome replication analysis module to be tested and a homologous recombination repair defect state evaluation module;
The whole genome sequencing data acquisition and comparison module comprises low-depth whole genome sequencing lower machine data for acquiring a sample to be tested, removing joints, comparing the low-depth whole genome sequencing lower machine data to a reference genome, and filtering repeated sequences generated by PCR according to comparison sequencing to obtain a comparison file;
the whole genome sequencing data quality control module comprises a quality control module for carrying out quality analysis on the comparison file to obtain quality information comprising comparison rate, sequencing depth, GC content and repetition rate, and filtering according to the quality information to obtain qualified sequencing data;
the pollution data filtering module is used for analyzing the pollution rate of the sequencing data and obtaining the sequencing data with the pollution rate smaller than a pollution rate threshold value;
the copy number variation analysis module comprises an ACE software determining the tumor purity of a low-depth whole genome sequencing sample and generating a total CNA spectrum, wherein the total CNA spectrum comprises a sample name, a chromosome, a starting position, a termination position, a copy number and segment information;
the LST value calculation module is used for calculating the LST value of a sample according to the total CNA spectrum, specifically deleting segments smaller than 3M, and drawing a segment density distribution diagram; taking a first local minimum value as CNA cutoff; if there is no local minimum between the specific interval 0.025 and 0.45, setting CNA cutoff as a small inflection point appearing in the first peak of the segments density curve, and increasing the LST of the sample by decreasing CNA cutoff; fitting two segments when the two adjacent segments are smaller than the CNA cutoff; the rule for calculating the CNA cutoff is as follows, if the local minimum is greater than 0.025 and less than 0.45, the first local minimum is taken as the CNA cutoff; if the local minimum is less than 0.025, taking 0.025 as CNA cutoff; if the local minimum value is greater than 0.45, taking 0.45 as CNA cutoff; when the local minimum value is larger than 0.45 and the derivative value is reduced but the positive and negative small inflection points are not changed before 0.45 of the density map, calculating the derivative of the density map between 0.025 and 0.45, wherein the difference value corresponding to the minimum value of the derivative is used as CNA cutoff; when the difference value of two adjacent segments is larger than CNA cutoff, the interval is smaller than 3Mb, and the length is larger than 10Mb, then the LST value is added with 1;
The whole genome replication analysis module of the sample to be tested comprises a module for judging whether whole genome replication occurs in the sample to be tested, if the whole genome replication occurs in the sample to be tested, the sample segments are extremely poor to be more than 2, and meanwhile, the number of peaks in a segment density map is more than 4, the LST value is reduced by 9, and the LST value is used as a final LST value;
the homologous recombination repair defect state evaluation module is used for judging whether the homologous recombination repair defect state is positive or negative according to the BRCA genotype and the final LST value;
in the whole genome replication analysis module of the sample to be tested, the method for judging whether the sample to be tested has whole genome replication comprises judging whether the sample has whole genome replication according to the segment density distribution diagram and the segment extreme difference condition, wherein the judging rule comprises,
a. when the sample segments are less than 1, the sample does not undergo whole genome replication;
b. when the sample segments are extremely poor to be more than 1 and the number of peaks is less than 3, the sample does not generate whole genome replication;
c. when the sample segments are extremely poor to be more than 1 and the number of peaks is more than or equal to 3, the sample is subjected to whole genome replication;
d. when the sample segments are extremely poor to be more than 9 and the number of peaks is more than or equal to 2, the sample is subjected to whole genome replication;
In the homologous recombination repair defect state evaluation module, the rule for judging the homologous recombination repair defect state comprises that when the BRCA genotype is of a mutant type, the homologous recombination repair defect state is judged to be positive no matter how the LST value is; when the BRCA genotype is wild type, the LST value is larger than or equal to the HRD biological threshold value, judging that the homologous recombination repair defect state is positive, otherwise, judging that the homologous recombination repair defect state is negative;
the HRD biological threshold is a threshold determined for HRD positivity for 95% of BRCA mutant samples in the model data.
10. The evaluation device according to claim 9, wherein: in the whole genome sequencing data acquisition and comparison module, the low-depth whole genome sequencing refers to a sequencing depth of not more than 5.
11. The evaluation device according to claim 9, wherein: the sequence with the removed linker was aligned to the reference genome hg19 using bwa-mem2 software.
12. The evaluation device according to claim 9, wherein: in the whole genome sequencing data quality control module, the comparison rate is more than 95%, and the depth is more than 0.8, so that the whole genome sequencing data quality control module is qualified.
13. The evaluation device according to claim 9, wherein: in the pollution data filtering module, the pollution rate threshold is a pollution rate of the sample estimated by using a population allele frequency construction model.
14. The evaluation device according to claim 9, wherein: the contamination rate threshold is 0.1.
15. The evaluation device according to claim 9, wherein: the whole genome replication analysis module of the sample to be detected also comprises a module for judging the peak displayed by the segments density distribution diagram, wherein the judging standard comprises (1) that only the peak which is more than 15% of the maximum peak is calculated as the peak; (2) When the number of the preliminary judgment peaks is larger than 2, judging according to the following rules: if the distance between the peak and the trough at the left side and the right side is greater than 4% of the maximum peak value, the peak participates in counting the number of the peak values; if the distance between the peak and the trough on the left side and the right side is less than 4% of the maximum peak value, the peak does not participate in counting the number of the peak values; if the distance between the peak and the trough on only one side of the left and right is less than 4% of the maximum peak, the next peak is still the same, then it is noted as a peak.
16. The assessment device according to any one of claims 9 to 15, wherein: the HRD biological threshold is 15.
17. An apparatus for evaluating homologous recombination repair defects, characterized in that: the apparatus includes a memory and a processor;
the memory comprises a memory for storing a program;
The processor comprising an evaluation method for implementing the homologous recombination repair defect of any of claims 1-8 by executing a program stored in the memory.
18. A computer-readable storage medium, characterized by: stored in the storage medium is a program executable by a processor to implement the method of assessing homologous recombination repair defects of any one of claims 1-8.
CN202111572513.6A 2021-12-21 2021-12-21 Method and device for evaluating homologous recombination repair defects and storage medium Active CN114242170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111572513.6A CN114242170B (en) 2021-12-21 2021-12-21 Method and device for evaluating homologous recombination repair defects and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111572513.6A CN114242170B (en) 2021-12-21 2021-12-21 Method and device for evaluating homologous recombination repair defects and storage medium

Publications (2)

Publication Number Publication Date
CN114242170A CN114242170A (en) 2022-03-25
CN114242170B true CN114242170B (en) 2023-05-09

Family

ID=80760499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111572513.6A Active CN114242170B (en) 2021-12-21 2021-12-21 Method and device for evaluating homologous recombination repair defects and storage medium

Country Status (1)

Country Link
CN (1) CN114242170B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109219852A (en) * 2016-05-01 2019-01-15 基因组研究有限公司 The method for characterizing DNA sample
CN111462823A (en) * 2020-04-08 2020-07-28 西安交通大学 Homologous recombination defect judgment method based on DNA sequencing data
CN113658638A (en) * 2021-08-20 2021-11-16 江苏先声医学诊断有限公司 Detection method and quality control system for homologous recombination defects based on NGS platform

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016025958A1 (en) * 2014-08-15 2016-02-18 Myriad Genetics, Inc. Methods and materials for assessing homologous recombination deficiency
CN111883211B (en) * 2020-08-07 2021-04-23 张哲� Gene scar for representing HRD homologous recombination repair defect and identification method
CN112164423B (en) * 2020-10-14 2021-03-23 深圳吉因加医学检验实验室 Fusion gene detection method, device and storage medium based on RNAseq data
CN112802548B (en) * 2021-01-07 2021-10-22 深圳吉因加医学检验实验室 Method for predicting allele-specific copy number variation of single-sample whole genome
CN114999568B (en) * 2021-06-28 2023-04-18 北京橡鑫生物科技有限公司 Calculation method of telomere allele imbalance TAI

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109219852A (en) * 2016-05-01 2019-01-15 基因组研究有限公司 The method for characterizing DNA sample
CN111462823A (en) * 2020-04-08 2020-07-28 西安交通大学 Homologous recombination defect judgment method based on DNA sequencing data
CN113658638A (en) * 2021-08-20 2021-11-16 江苏先声医学诊断有限公司 Detection method and quality control system for homologous recombination defects based on NGS platform

Also Published As

Publication number Publication date
CN114242170A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
Ding et al. Expanding the computational toolbox for mining cancer genomes
US9965585B2 (en) Detection of genetic or molecular aberrations associated with cancer
CN109949861B (en) Tumor mutation load detection method, device and storage medium
CN109767810B (en) High-throughput sequencing data analysis method and device
KR20170125044A (en) Mutation detection for cancer screening and fetal analysis
CN112164423B (en) Fusion gene detection method, device and storage medium based on RNAseq data
CN110910957A (en) Single-tumor-sample-based high-throughput sequencing microsatellite instability detection site screening method
AU2022202130A1 (en) Detection of genetic or molecular aberrations associated with cancer
CN105483210A (en) RNA (ribonucleic acid) editing locus detection method
CN114242170B (en) Method and device for evaluating homologous recombination repair defects and storage medium
Li et al. Combined analysis with copy number variation identifies risk loci in lung cancer
CN114067908B (en) Method, device and storage medium for evaluating single-sample homologous recombination defects
Lazar et al. High-resolution genome-wide mapping of chromosome-arm-scale truncations induced by CRISPR-Cas9 editing
CN114067909B (en) Method, device and storage medium for correcting homologous recombination defect score
Shih et al. Selective and mechanistic pressures shaping cancer aneuploidies
Csernák et al. Application of Targeted Next-generation Sequencing, TruSeq Custom Amplicon Assay for Molecular Pathology Diagnostics on Formalin-fixed and Paraffin-embedded Samples.
Csernák et al. Manuscript title: Application of targeted-Next-Generation Sequencing, TruSeq Custom Amplicon assay for molecular pathology diagnostics on formalin-fixed and paraffin embedded samples. Running title: Targeted-Next-Generation sequencing for diagnostic use

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant