CN114694752B - Method, computing device and medium for predicting homologous recombination repair defects - Google Patents

Method, computing device and medium for predicting homologous recombination repair defects Download PDF

Info

Publication number
CN114694752B
CN114694752B CN202210226275.1A CN202210226275A CN114694752B CN 114694752 B CN114694752 B CN 114694752B CN 202210226275 A CN202210226275 A CN 202210226275A CN 114694752 B CN114694752 B CN 114694752B
Authority
CN
China
Prior art keywords
label
data
tag
mutation
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210226275.1A
Other languages
Chinese (zh)
Other versions
CN114694752A (en
Inventor
王凯
陈丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhiben Medical Laboratory Co ltd
Origimed Technology Shanghai Co ltd
Original Assignee
Shanghai Zhiben Medical Laboratory Co ltd
Origimed Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhiben Medical Laboratory Co ltd, Origimed Technology Shanghai Co ltd filed Critical Shanghai Zhiben Medical Laboratory Co ltd
Priority to CN202210226275.1A priority Critical patent/CN114694752B/en
Publication of CN114694752A publication Critical patent/CN114694752A/en
Application granted granted Critical
Publication of CN114694752B publication Critical patent/CN114694752B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Abstract

The present disclosure relates to a method, computing device, and computer storage medium for detecting homologous recombination repair defects. The method comprises the following steps: generating comparison result data about the sample to be detected; determining variant sites based on the alignment result data; acquiring base sequences of sites in each preset range of the upstream and downstream of a mutation base where the mutation site is located on a human reference genome based on the mutation site so as to generate background map information; generating label data based on the background map information, the label data comprising a plurality of labels; determining similarity between the label and tumor mutation label data of a predetermined database so as to generate a label matrix; and predicting the homologous recombination repair defect of the sample to be tested based on the generated label matrix. The method can effectively improve the accuracy and comprehensiveness of predicting the homologous recombination repair defects.

Description

Method, computing device and medium for predicting homologous recombination repair defects
Technical Field
The present disclosure relates generally to bioinformatics processing, and in particular, to methods, computing devices, and computer storage media for predicting homologous recombination repair defects.
Background
DNA can be damaged by various endogenous and exogenous factors in the normal metabolic process, DNA damage, whether the damage is caused by complete repair or untimely repair, can be mutated and integrated into a genome and is continuously replicated, and the accumulation of damage can cause the abnormality of an organism, such as the occurrence of cancer. Normally, DNA damage is repaired by DNA protection mechanisms, including DNA single strand repair and double strand repair. DNA single strand repair is mainly accomplished by poly ADP-ribose polymerase (PARP) recruitment repair proteins, while DNA double strand damage repair is repaired by either nonhomologous end joining or Homologous Recombination Repair (HRR) mechanisms. Both of these DNA damage repair mechanisms, if inhibited, can greatly impair cell proliferative capacity, possibly promoting apoptosis. In particular, tumor specimens carrying defects in homologous recombination repair can be combined with RARP enzyme by using PARP inhibitor, so that PARP can not recruit repair protein for damage repair, single-strand damage repair of DNA is blocked, and the unrepaired single-strand break can further cause double-strand break of DNA in the replication process, thereby promoting the apoptosis of tumor cells. According to previous studies, patients with homologous recombination repair deficiency (HRD) positive patients respond better to platinum and RARP inhibitors than HRD negative patients, and three PARP inhibitors nilapali, olapari, and akapraepal have been approved by CFDA for use in the maintenance therapy of platinum-sensitive recurrent ovarian cancer. Therefore, it is important to accurately detect homologous recombination repair defects.
Conventional methods for detecting homologous recombination repair defects include, for example, three methods. The first is for example: detecting BRCA1/2 or other HRR gene mutations, or LOH (Loss of Heterozygosity, LOH) in tissue samples and blood samples. Although the currently accepted, most well-documented marker for HRD detection is the BRCA1/2 pathogenic mutation, it is not sufficient to use BRCA1/2 alone as the criterion for HRD division, because homologous recombination repair is an important pathway for normal cell repair disruption, involving complex information pathways of many steps, and although the most critical proteins are BRCA1 and BRCA2, there are other genes related to homologous recombination repair, such as PALB2, RAD51, ATM, etc., in addition to these two genes. According to the study, in 25% of ovarian cancer patients with homologous recombination defect, BRCA1/2 deletion exists, and the range of the gene is enlarged, and the number of 25% is increased to 50% -70%. More specifically, HRD was also present in samples negative for the BRCA1/2 mutation.
The second is, for example: a sequencing chip for identifying HRD related genes aims at performing targeted sequencing on dozens or twenty homologous recombination pathway genes so as to predict the variation condition of the HRD genes. However, this method has a great limitation because the chip design target range is too narrow and only the fixing point can be detected.
The third is for example: and generating three scores of heterozygous deletion, telomere imbalance and large fragment migration for evaluation based on a genome-wide SNP algorithm. However, both methods have the problem of high cost, and both methods require special probe design for the application and cannot be applied with other products.
In summary, the conventional methods for detecting homologous recombination repair defects have the following disadvantages: the method can only detect the fixed point, is not comprehensive, has great limitation and has high detection cost.
Disclosure of Invention
The present disclosure provides a method, a computing device, and a computer storage medium for detecting homologous recombination repair defects, which can effectively improve the accuracy and comprehensiveness of predicting homologous recombination repair defects.
According to a first aspect of the present disclosure, a method of detecting a homologous recombination repair defect is provided. The method comprises the following steps: generating comparison result data about the sample to be tested based on the comparison of the sequencing data about the sample to be tested and the human reference genome sequence; determining variation sites based on the comparison result data; acquiring base sequences of sites in each preset range of the upstream and downstream of a mutation base where the mutation site is located on a human reference genome based on the mutation site so as to generate background map information; generating label data based on the background map information, the label data comprising a plurality of labels; determining similarity between the label and tumor mutation label data of a predetermined database so as to generate a label matrix; and predicting the homologous recombination repair defect of the sample to be tested based on the generated label matrix.
According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.
According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.
In some embodiments, determining the similarity between the label and the lesion mutation label data of the predetermined database to generate the label matrix comprises: calculating a similarity value between each of a plurality of tags included in the tag data and each of the lesion mutation tag data of the predetermined database; comparing the calculated similarities to determine a maximum similarity value associated with each tumor mutation signature data; and generating a label matrix based on the maximum similarity value associated with each tumor mutation label data and the sample identifier to be detected, wherein the label matrix indicates the sample identifier to be detected and a plurality of label characteristic values associated with the sample identifier to be detected, and each label characteristic value in the plurality of label characteristic values is used for indicating the maximum similarity value associated with the corresponding tumor mutation label data.
In some embodiments, predicting the homologous recombination repair defect for the test sample based on the generated tag matrix comprises: generating an input data matrix based on the label matrix and the number of samples to be detected; determining the percentage of the target screening characteristics; generating target input features based on the input data matrix and the target screening feature percentage; and extracting features of the target input features via the trained predictive model to generate a prediction result regarding homologous recombination repair defects of the sample to be tested.
In some embodiments, determining the target screening feature percentage comprises: screening the input data matrix based on the screening feature percentage to obtain candidate feature values; fitting the candidate label characteristic value, the input data matrix serving as the test set and classification data about homologous recombination repair defects to obtain a fitting model; calculating the accuracy of the fitting model; and determining the target fitting model and the target screening feature percentage based on the accuracy of the fitting model and the screening feature percentage.
In some embodiments, determining the target screening feature percentage comprises: and determining the screening characteristic percentage corresponding to the maximum accuracy of the fitting model as the target screening characteristic percentage.
In some embodiments, predicting the homologous recombination repair defect for the test sample based on the generated tag matrix comprises: determining a target tag characteristic value in tag data based on the tumor mutation tag data of a predetermined database and the associated cancer species of a sample to be detected; determining whether the characteristic value of the target label meets a preset condition; in response to the fact that the target label characteristic value meets the preset condition, determining that the to-be-detected sample has the homologous recombination repair defect; and determining that the sample to be tested does not have the homologous recombination repair defect in response to determining that the target label characteristic value does not meet the predetermined condition.
In some embodiments, determining the mutation site based on the alignment result data comprises: determining candidate variation sites based on the comparison result data; and filtering the determined candidate mutation sites based on the number of support sequences, the quality value of the mutated base, the mutation frequency and the positive-negative chain ratio to determine the mutation sites.
In some embodiments, obtaining the base sequences of sites within a predetermined range upstream and downstream of the corresponding mutated base on the human reference genome so as to generate the background map information comprises: positioning the determined mutation sites on a human reference genome sequence so as to obtain the base sequences of sites 1bp respectively upstream and downstream of the mutation bases of the mutation sites; and generating background map information based on the base sequence.
In some embodiments, generating the tag data based on the context map information comprises: tag data is generated via non-negative matrix factorization based on background map information, the tag data including a plurality of tag feature values, the background map information indicating mutation spectra of the 96 base sequences.
In some embodiments, the predetermined database is a COSMIC database, and the predetermined database includes 30 tumor mutation signature features.
In some embodiments, determining that the target tag feature value satisfies the predetermined condition comprises: determining that the target tag feature value satisfies a predetermined condition in response to determining that any of: each target label characteristic value in the label data is greater than or equal to a preset threshold value; or the characteristic values of the target labels in the label data are respectively greater than or equal to the corresponding preset thresholds. This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.
Drawings
FIG. 1 shows a schematic diagram of a system for implementing a method of detecting a homologous recombination repair defect according to an embodiment of the present disclosure.
FIG. 2 shows a flow diagram of a method for detecting homologous recombination repair defects according to an embodiment of the present disclosure.
Fig. 3 shows a flow diagram of a method for generating a tag matrix according to an embodiment of the present disclosure.
FIG. 4 shows a flow chart of a method for predicting homologous recombination repair defects with respect to a test sample according to an embodiment of the present disclosure.
Fig. 5 shows a flow diagram of a method for determining a target screening feature percentage according to an embodiment of the present disclosure.
FIG. 6 schematically illustrates a block diagram of an electronic device suitable for use to implement embodiments of the present disclosure; and (c) and (d).
Like or corresponding reference characters indicate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the accompanying drawings, it is to be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.
As described above, the conventional method for detecting the repair status of homologous recombination has disadvantages in that: the method can only detect the fixed point, is not comprehensive, has great limitation and has high detection cost.
To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for predicting homologous recombination repair defects. The scheme comprises the following steps: by obtaining the base sequences of the sites within the preset ranges of the upstream and downstream of the mutation base where the mutation site is located on the human reference genome based on the determined mutation site, background map information is generated, and the background base sequences of the mutation sites are strongly related to the exogenous carcinogenic substance, so that the generated background map information can contain richer variation information which is strongly related to the exogenous carcinogenic substance. In addition, a tag matrix is generated by calculating a similarity between tag data generated based on the background map information and tumor mutation tag data of a predetermined database; the method can enable the label matrix used for predicting the homologous recombination repair state to not only comprise the existing experience indicated by the tumor mutation label characteristics in the preset database, but also comprise more abundant and potential other mutation characteristics which are extracted through the similarity value and are not contained in the preset database, and further can predict the homologous recombination repair defect by combining the existing experience indicated by the tumor mutation label characteristics in the preset database and the abundant mutation characteristics which are strongly related to the existence of the exogenous carcinogen, so that the comprehensiveness and the accuracy of predicting the homologous recombination repair defect can be effectively improved.
FIG. 1 shows a schematic diagram of a system 100 for implementing a method of detecting a homologous recombination repair defect according to an embodiment of the present disclosure. As shown in fig. 1, system 100 includes, for example, a computing device 110, a sequencing device 130, a messaging server 140, and a network 150. The computing device 110 may interact with the sequencing device 130 and the messaging server 140 in a wired or wireless manner via the network 150.
As for the sequencing apparatus 130, it is used, for example, for sequencing a sample to be tested from a subject to be tested so as to obtain sequencing data on the sample to be tested. The sequencing method of the sequencing apparatus 130 is not limited to WES sequencing, and may be applied to multi-gene targeted sequencing (number of genes > = 300), WGS, and other sequencing methods to obtain sequencing data about a sample to be tested. The sequencing device 130 is also used to send sequencing data about the sample to be tested to the computing device 110. In some embodiments, sequencing data about the sample to be tested may also come from the messaging server 140.
With respect to computing device 110, for example, for generating comparison result data with respect to a tissue sample to be tested; based on the alignment result data, the mutation sites are determined. The computing device 110 is further configured to obtain, based on the variation site, base sequences of sites within predetermined ranges upstream and downstream of a mutant base where the variation site is located on the human reference genome, so as to generate background map information; generating label data based on the background map information; determining similarity between the label and tumor mutation label data of a predetermined database so as to generate a label matrix; and predicting the homologous recombination repair defect of the sample to be tested based on the generated label matrix.
In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. Computing device 110 includes, for example: an alignment result data generation unit 112, a mutation site determination unit 114, a background map information determination unit 116, a tag data generation unit 118, a tag matrix generation unit 120, and a homologous recombination repair defect prediction unit 122. The above-mentioned alignment result data generating unit 112, mutation site determining unit 114, background map information determining unit 116, tag data generating unit 118, tag matrix generating unit 120, and homologous recombination repair defect predicting unit 122. May be configured on one or more computing devices 110.
And an alignment result data generating unit 112 for generating alignment result data on the sample to be tested based on the alignment of the sequencing data on the sample to be tested and the human reference genome sequence.
A mutation site determining unit 114 for determining a mutation site based on the alignment result data.
And a background map information determining unit 116 for obtaining, based on the variation site, a base sequence of a site within each predetermined range upstream and downstream of a mutant base where the variation site is located on the human reference genome so as to generate background map information.
Regarding the tag data generating unit 118, it is used for generating tag data based on the background map information, and the tag data includes a plurality of tags.
Regarding the tag matrix generating unit 120, it is used to determine the similarity between the tag and the tumor mutation tag data of the predetermined database, so as to generate the tag matrix.
And a homologous recombination repair defect predicting unit 122 for predicting a homologous recombination repair defect with respect to the sample to be tested based on the generated tag matrix.
A method 200 for detecting homologous recombination repair defects according to an embodiment of the present disclosure will be described below in conjunction with fig. 2. FIG. 2 shows a flow diagram of a method 200 for detecting a homologous recombination repair defect according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 202, the computing device 110 determines a mutation site based on the alignment result data.
For example, the computing device 110 obtains sequencing data (i.e., a sequencer sequence in the format of a fastq file) from the sequencing device 130 regarding a sample to be tested; preprocessing the sequencing data (the preprocessing comprises removing sequencing sequences containing sequencing joints in the sequencing data by using bioinformatics software such as Trimmomatic/Cutadap and the like); the retained sequenced sequences are then aligned with human genomic sequences (e.g., human genomic sequences of the hg19 version of the human reference genome) using alignment software bwa to generate BAM files, which are the alignment result data for the tissue sample to be tested.
In some embodiments, computing device 110 may further sort and index the alignment files through sambamba software and identify duplicate sequences based on MarkDuplicates of picard software to optimize the BAM files.
At step 204, the computing device 110 determines a mutation site based on the alignment result data. In some embodiments, the computing device 110 may determine the mutation sites based on the BAM file or via an optimized BAM file as an input file, performing mutation analysis with the mutec 2 software (e.g., using default parameters).
Regarding the method for determining the mutation site, it includes, for example: the computing device 110 determines candidate variant sites based on the comparison result data; and filtering the determined candidate mutation sites based on the number of support sequences, the quality value of the mutated base, the mutation frequency and the positive-negative chain ratio to determine the mutation sites. For example, the computing device 110 performs mutation analysis on the alignment result data (i.e., BAM file) by using mutation software (e.g., mutation detection software such as mutec 2) to obtain mutation sites; then, filtering the obtained mutation sites based on the quality value, the positive-negative chain ratio, the support number and the SNP; based on the variant sites (i.e., the determined variant sites) that remain via filtering, an input file is generated for the variant sites. For example, the computing device 110 filters the detected mutation sites, the filter conditions support sequence number > =10, mutation base quality value > =30, mutation frequency > 1%, and a positive to negative ratio between 10% and 90%, and the remaining sites are filtered and retained to generate an input file (e.g., VCF file) for the mutation sites.
The input file for the variant site may be a VCF file or a txt file, the input file for the variant site indicating, for example, at least the following information: the sample to be tested is marked (such as the name of the sample to be tested), the chromosome position, the mutation starting position, the mutation ending position, the base of the reference gene and the base information after mutation. Table 1 below schematically shows the input files for the mutation sites.
TABLE 1
Figure BDA0003539279340000091
At step 206, the computing device 110 obtains, based on the variation site, base sequences of sites within predetermined ranges upstream and downstream of a mutant base at which the variation site is located on the human reference genome, so as to generate background map information.
Regarding the method of generating background map information, it includes, for example: positioning the determined variation site on a human reference genome sequence so as to obtain a base sequence of a site 1bp respectively upstream and downstream of a mutation base where the variation site is located; and generating background map information based on the base sequence.
It is to be understood that a mutation site, for example, a single nucleotide mutation, may be 96 cases in consideration of the case of the background bases (i.e., the base sequences of sites corresponding to 1bp each upstream and downstream of the mutated base). For example, by replacing the mutated base with an "X", there can be 6 combinations of mutations, namely: CXG → AXT, CXG → GXC, CXG → TXA, TXA → AXT, TXA → CXG, and TXA → GXC. The first base (i.e., the site base 1bp upstream of the mutated base corresponding to the mutated site) represents the 5 'end, which is in four cases (A, C, G, T), and the third base (i.e., the site base 1bp downstream of the corresponding mutated base) represents the 3' end, which is in four cases (A, C, G, T), whereby the possible change of the base sequence consisting of the mutated base and the bases of the sites 1bp upstream and downstream thereof is 4 × 6 × 4 =96cases. The background base sequence of these mutation sites is strongly correlated with the presence of exogenous carcinogens. For example, the transformation of C.G → A.T. (e.g., ". Cndot." represents a mutated base) is associated with smoking lung cancer samples, whereas C.G → T.A is frequently overexpressed in skin cancers due to over-irradiation of ultraviolet light. It is understood that differences between different samples can be characterized using background profile information (alternatively referred to as "mutation profile information") formed based on these background base sequences. Thus, the present disclosure can generate background map information indicating mutation spectra of 96 base sequences based on the mutant base where the mutation site is located and the site base sequences of 1bp each upstream and downstream thereof.
At step 208, the computing device 110 generates tag data based on the context map information, the tag data including a plurality of tags. For example, the computing device 110 generates tag data via non-negative matrix factorization based on background map information, the tag data including a plurality of tag feature values, the background map information indicating a spectrum of mutations of 96 base sequences.
In some embodiments, the computing device 110 obtains site base sequences 1bp upstream and downstream of the corresponding mutated base at the mutation site by mapping the determined mutation site onto a human reference genomic sequence to form background map information indicative of the 96 base sequences, e.g., a counting matrix indicative of the corresponding 96 base sequences in the sample.
Then, the computing apparatus 110 decomposes the background map information indicating the mutation spectra of the 96 base sequences into tag data for characterizing the contribution of the background map information in each sample, based on the structural features of the mutation background map information, using the NMF algorithm.
The tag data includes, for example, 4 tags, i.e., tag a, tag B, tag C, and tag D. Each of the 4 tags indicates the probability of 96 base sequences being combined based on different ratios. The tag data for example comprises other numbers of tags, e.g. for colorectal cancer, there will be more mutations, and thus tag data generated based on background profile information will differ in the number and value of tags.
At step 210, the computing device 110 determines a similarity between the label and the lesion mutation label data of the predetermined database to generate a label matrix.
It should be understood that if background map information indicating mutation spectra of 96 base sequences is directly decomposed into 30 tumor mutation signature features in a COSMIC (category of biological Mutations in Cancer) database, there is a problem that the internal background structure of the base of an input file is not exactly the same as that of a known reference background, which may predict tag signatures of different classes into one kind of tag signature, and similarly, may cause overfitting, predicting one kind of tag signature into two or more kinds of tag signatures. The method finds suitable labels according to the mutation background structure characteristics by utilizing an NMF algorithm, the labels can well represent the contribution in each sample, then calculates the similarity between the new labels and a reservation database (for example, 30 COSMIC label characteristic values existing in the COSMIC label database are subjected to similarity analysis (cosine similarity), the similarity between the new labels and the reservation database is obtained, and further generates label matrixes, the label characteristic values in the label matrixes can integrate the existing label experience and the potential variation information strongly related to the existence of exogenous carcinogens, so that the method can not only avoid predicting label characteristic values which are originally different in category into one label characteristic value or predicting one label characteristic value into two or more label characteristic values, but also avoid overfitting.
Regarding the tag matrix, it indicates, for example, that the tag matrix indicates the sample identifier to be tested and a plurality of tag feature values associated with the sample identifier to be tested, and each of the plurality of tag feature values is used to indicate a maximum similarity value related to the corresponding tumor mutation tag data.
Regarding the predetermined database, which is, for example, a COSMIC (cosmetic mutation in Cancer) database, the predetermined database includes 30 tumor mutation signature characteristics.
Regarding the method of generating the tag matrix, it includes, for example: calculating a similarity value between each of a plurality of tags included in the tag data and each of the lesion mutation tag data of the predetermined database; comparing the calculated similarities to determine a maximum similarity value associated with each tumor mutation signature data; and generating a label matrix based on the maximum similarity value associated with each tumor mutation label data and the sample identifier to be detected, wherein the label matrix indicates the sample identifier to be detected and a plurality of label characteristic values associated with the sample identifier to be detected, and each label characteristic value in the plurality of label characteristic values is used for indicating the maximum similarity value associated with the corresponding tumor mutation label data. The method 300 for generating the tag matrix will be specifically described below with reference to fig. 3, and will not be described herein again.
For example, the computing device 110 performs a similarity analysis, e.g., of the signature determined at step 208 with tumor mutation signature features in the tumor mutation signature data of the predetermined database (e.g., 30 tumor mutation signature features in the COSMIC database). The method for performing the similarity analysis is described below with reference to formula (1).
Figure BDA0003539279340000121
In the above formula (1), a represents a label matrix. A. The i Representing the ith label in the label matrix. B represents tumor mutation signature data of the predetermined database. B is i Represents the ith tumor mutation signature characteristic in the tumor mutation signature data. cosine (a, B) represents a similarity value indicating a similarity relationship between the tag data and tumor mutation tag data of a predetermined database.
The computing device 110 then screens the calculated similarity values, determines the largest similarity value associated with each tumor mutation signature feature in the COSMIC database, and generates a signature matrix based on a set of these largest similarity values. In some embodiments, the tag matrix is referred to as an ori-hrd-score matrix, for example.
At step 212, the computing device 110 predicts a homologous recombination repair defect for the sample under test based on the generated tag matrix.
In some embodiments, the method for predicting homologous recombination repair defects of a test sample includes, for example: determining a target tag characteristic value in tag data based on the tumor mutation tag data of a predetermined database and the associated cancer species of a sample to be detected; determining whether the characteristic value of the target label meets a preset condition; in response to the fact that the target label characteristic value meets the preset condition, determining that the to-be-detected sample has the homologous recombination repair defect; and determining that the sample to be tested does not have the homologous recombination repair defect in response to determining that the target label characteristic value does not meet the predetermined condition. The method 400 for predicting the homologous recombination repair defect of the sample to be tested will be described in detail with reference to fig. 4, and will not be described herein again.
In other embodiments, a method for predicting the repair status of homologous recombination in a test sample comprises: the computing device 110 generates an input data matrix based on the label matrix and the number of samples to be tested; determining the percentage of the target screening characteristics; generating target input features based on the input data matrix and the target screening feature percentage; and extracting features of the target input features via the trained predictive model to generate a prediction result regarding homologous recombination repair defects of the sample to be tested. By employing the above approach, the present disclosure can accurately and comprehensively predict HRD for pan-cancer or single-cancer species.
With respect to the input data matrix, which is, for example, n x 30 input data matrix generated by computing device 110 based on the tag matrix (e.g., ori-hrd-score matrix) generated at step 210 and the number of samples to be measured (e.g., number of samples to be measured n). The input data matrix indicates, for example, a label matrix of a plurality of samples to be measured.
Table 2 below shows, for example, a portion of the data in the 13 × 30 input data matrix. Wherein the number of samples to be tested n =13, and the input data matrix comprises 30 label eigenvalues for each of the 13 samples to be tested. It should be understood that table 2 only schematically shows 7 of the 30 feature values for each of the 13 samples.
TABLE 2
Figure BDA0003539279340000131
Figure BDA0003539279340000141
Regarding the method of determining the percentage of features of a target screening, it includes, for example: the computing device 110 filters against the input data matrix based on the filtered feature percentage to obtain candidate feature values; fitting the candidate characteristic values, the label matrix of the training set and classification data about homologous recombination repairing defects to obtain a fitting model; calculating the accuracy of the fitting model; and determining the target fitting model and the target screening feature percentage based on the accuracy of the fitting model and the screening feature percentage. The method 500 for determining the percentage of the target screening feature will be described in detail with reference to fig. 5, and will not be described herein again.
For example, the computing device 110 randomly classifies the n × 30 input data matrices according to a predetermined ratio (e.g., 2:1) to form an input data matrix as a training set and an input data matrix as a test set; and performing cross validation on the training set so as to screen out the target input features.
Regarding the method of generating the target input features, for example, it includes: the computing device 110 performs a screening against the input data matrix based on a screening feature percentage (the screening feature percentage is performed at a fixed interval, for example, a percentage of 2, i.e., in increments of 2% between 1 and 100) to obtain candidate tag feature values; fitting the candidate label characteristic value, the input data matrix serving as the test set and classification data about homologous recombination repairing defects to obtain a fitting model; calculating the accuracy of the fitting model; and determining the target fitting model and the target screening feature percentage based on the accuracy of the fitting model and the screening feature percentage.
Screening feature percentages for determination of targets include: and determining the screening characteristic percentage corresponding to the maximum accuracy of the fitting model as the target screening characteristic percentage. After determining the target screening feature percentage, the computing device 110 may generate the target input features based on the input data matrix and the target screening feature percentage. For example, the accuracy of 50 training models and the interval corresponding to the screening feature percentage can be obtained, the two data are plotted, the accuracy of the performance of the fitting model can be seen to change along with the feature screening proportion, and when the screening value is x%, the accuracy reaches the peak value, so that the optimal model accuracy can be reflected, for example. The most accurate model can be achieved when the feature value is 5% of the model, the accuracy rate is 0.88, and the feature value corresponding to the 5% is the combination of the optimal tag feature values, such as the combination of tag feature value 3 (mark 3), tag feature value 10 (mark 10), tag feature value 13 (mark 13), tag feature value 15 (mark 15) and tag feature value 26 (mark 26) in the input data matrix. The target input features finally generated by the computing device 110 are, for example, a combination of tag feature value 3 (mark 3), tag feature value 10 (mark 10), tag feature value 13 (mark 13), tag feature value 15 (mark 15), and tag feature value 26 (mark 26) in the input data matrix.
With respect to the predictive model, it is trained against a target fitting model, for example, using samples of the screened target input features (e.g., with classification labels indicating "HRD present" or "HRD not present") to generate a trained predictive model.
In some embodiments, the method 200 further comprises: the predictive model is evaluated. For example, the computing device 110 inputs the target input features as a test set into the prediction model, determines the consistency between the HRD classification indicated by the prediction result and the classification label carried by the target input features on the HRD, further determines the test accuracy based on the test set, determines whether the deviation between the peak of the test accuracy and the peak of the training accuracy obtained based on the training set is greater than or equal to a predetermined deviation threshold, and determines that the prediction model meets the predetermined condition if it is determined that the deviation between the peak of the test accuracy and the peak of the training accuracy is less than the predetermined deviation threshold, indicating that the two are closer. For example, if the peak value of the test accuracy is 0.87 and is closer to the peak value of the training accuracy of 0.88, the prediction model is a prediction model meeting the predetermined condition.
In the above scheme, the base sequences of the sites within the predetermined ranges upstream and downstream of the mutation base where the mutation site is located on the human reference genome are obtained based on the determined mutation site so as to generate the background map information, and since the background base sequence of the mutation site is strongly related to the presence of the exogenous carcinogenic substance, the present disclosure enables the generated background map information to include more abundant mutation information strongly related to the presence of the exogenous carcinogenic substance. In addition, a tag matrix is generated by calculating a similarity between tag data generated based on the background map information and tumor mutation tag data of a predetermined database; the method can enable the label matrix used for predicting the homologous recombination repair state to not only comprise the existing experiences indicated by the tumor mutation label characteristics in the preset database, but also comprise more abundant and potential other mutation characteristics which are not contained in the preset database and extracted through the similarity value, and further can be combined with the existing experiences indicated by the tumor mutation label characteristics in the preset database.
A method 300 for generating a tag matrix according to an embodiment of the present disclosure will be described below in conjunction with fig. 3. Fig. 3 shows a flow diagram of a method 300 for generating a tag matrix according to an embodiment of the present disclosure. It should be understood that the method 300 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 302, the computing device 110 calculates a similarity value between each of a plurality of tags included in the tag data and each of the lesion mutation tag data of the predetermined database.
As for the tag data, it is generated, for example, via the following steps: the computing device 110 maps the variation sites determined based on the input file (VCF file) for the variation sites generated in the method 200 onto the human reference genomic sequence to obtain site base sequences 1bp each upstream and downstream of the corresponding mutated base to form background map information indicative of 96 base sequences, e.g., a count matrix indicative of the corresponding 96 base sequences in the sample. For example, the computing device 110 employs an NMF algorithm to decompose the background map information into label data. The tag data includes, for example, 4 tags, i.e., tag a, tag B, tag C, and tag D. Each of the 4 tags indicates the probability of 96 base sequences being combined based on different ratios.
The computing device 110 performs similarity analysis on a plurality of labels (e.g., label a, label B, label C, label D) included in the label data with 30 tumor mutation label features in the COSMIC database, respectively, to obtain similarity values (e.g., 4 × 30=210 similarity values are obtained). The similarity value is between 0 and 1. It should be understood that a similarity value closer to 1 indicates a more similar signature to the corresponding tumor mutation signature in the COSMIC database, and that the more similar the signature to which the similarity value corresponds is to the corresponding tumor mutation signature in the COSMIC database.
At step 304, the computing device 110 compares the calculated similarities to determine a maximum similarity value associated with each of the tumor mutation signature data.
At step 306, the computing device 110 generates a tag matrix based on the maximum similarity value associated with each tumor mutation tag data and the sample under test identification, the tag matrix indicating the sample under test identification and a plurality of tag feature values associated with the sample under test identification, each of the plurality of tag feature values indicating the maximum similarity value associated with the corresponding tumor mutation tag data.
For example, the computing device 110 forms a label matrix (e.g., an ori-hrd-score matrix) based on the set of maximum similarity values for the sample under test, such as the 30 label feature values associated with the sample under test identification in table 2 above (only 7 label feature values are schematically shown in table 2). The 30 signature feature values in the signature matrix indicate the largest similarity values associated with the 30 tumor mutation signature features in the COSMIC database, respectively.
By adopting the above means, the tag matrix generated by the method disclosed by the disclosure can utilize experience indicated by tumor mutation tag characteristics in a known predetermined database, and can extract potential characteristics which are not included in the known predetermined database through the similarity value, so that a tag set reflecting more comprehensive information can be generated, and the comprehensiveness and accuracy of predicting homologous recombination repair defects can be improved.
A method 400 for predicting homologous recombination repair defects with respect to a test sample according to an embodiment of the present disclosure will be described below in conjunction with fig. 4. FIG. 4 shows a flow diagram of a method 400 for predicting homologous recombination repair defects with respect to a test sample, according to an embodiment of the present disclosure. It should be understood that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 402, the computing device 110 determines a target tag feature value in the tag data based on the tumor mutation tag data of the predetermined database and the associated cancer species of the sample to be tested.
For example, if the computing device 110 identifies a target abrupt change tag data of the predetermined database as being related to the associated cancer of the sample to be tested, the tag feature value corresponding to the abrupt change tag data related to the associated cancer of the sample to be tested in the tag data is determined as the target tag feature value. For example, the sample to be tested is a sample type of ovarian cancer, and as for ovarian cancer, there are COSMIC tumor tag 1, COSMIC tumor tag 3, and COSMIC tumor tag 5 in the COSMIC database, and in addition, COSMIC tumor tag 1 is present in all cancer species.
For example, the computing device 110 determines a target tag feature value for tag feature value 3 (i.e., mark 3) and tag feature value 5 (i.e., mark 5) included in a tag matrix (e.g., an ori-hrd-score matrix) that correspond to COSMIC tumor tag 3, COSMIC tumor tag 5.
At step 404, the computing device 110 determines whether the target tag feature value satisfies a predetermined condition.
Regarding a method of determining that a target tag feature value satisfies a predetermined condition, it includes, for example: determining that the target tag feature value satisfies a predetermined condition in response to determining that any of: each target label characteristic value in the label data is greater than or equal to a preset threshold value; or each target label characteristic value in the label data is respectively greater than or equal to each corresponding preset threshold value. . For example, the computing device 110 determines whether the tag feature value 3 (i.e., mark 3) and the tag feature value 5 (i.e., mark 5) in the ori-hrd-score matrix are both greater than or equal to a predetermined threshold (e.g., cutoff, e.g., 0.7)
At step 406, if the computing device 110 determines that the target tag feature value satisfies the predetermined condition, it is determined that the sample under test has a homologous recombination repair defect. For example, for sample 5 in table 3 below, both tag feature value 3 (i.e., mark 3) and tag feature value 5 (i.e., mark 5) are greater than or equal to the predetermined threshold value of 0.7, and thus, the computing device 110 determines that sample 5 has a homologous recombination repair defect. For example, the test sample may be sensitive to platinum drugs and PARP inhibitors.
At step 408, if the computing device 110 determines that the target tag feature value does not satisfy the predetermined condition, it is determined that the sample under test does not have a homologous recombination repair defect. For example, for sample 1 in table 3 below, neither signature value 3 nor signature value 5 is greater than or equal to the predetermined threshold of 0.7, and thus, the computing device 110 determines that sample 1 is free of homologous recombination repair defects.
For example, test samples from 13 ovarian cancer patients are sequenced and a signature matrix (e.g., an ori-HRD-score matrix) is obtained, and a homologous recombination repair defect (HRD) prediction is performed based on the signature matrix. Table 3 below shows, for example, the signature characteristic values of the test samples for the above-mentioned 13 ovarian cancer patients and the predicted results regarding repair defects by homologous recombination (i.e., HRD predicted results)
TABLE 3
Figure BDA0003539279340000191
With respect to the medication efficacy information in table 3 above, it indicates the state in which the patient is effective or ineffective after taking the medication. Following up the efficacy of PARPi inhibitors for up to one year or more in ovarian cancer patients, the patients all used olaparide. When the patient has complete remission and partial remission for more than three consecutive months, the medication is considered to be effective, otherwise the medication is considered to be ineffective. Research data show that the state of homologous recombination repair defects predicted by the method is consistent with that indicated by medication curative effect information for the 13 samples to be tested. Therefore, the method and the device can obviously improve the accuracy of predicting the homologous recombination repair defect.
A method 500 for determining a percentage of a target screening feature according to an embodiment of the present disclosure will be described below in conjunction with fig. 5. Fig. 5 shows a flow diagram of a method 500 for determining a target screening feature percentage in accordance with an embodiment of the present disclosure. It should be understood that the method 500 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 502, the computing device 110 filters against the input data matrix based on the filtered feature percentage to obtain candidate tag feature values.
For example, the input data matrix shown in table 2 is filtered to obtain candidate feature values in terms of a filtered feature percentage (e.g., the filtered feature percentage is incremented in 2% increments at intervals, i.e., between 1 and 100).
At step 504, the computing device 110 fits the candidate tag eigenvalues, the input data matrix as a test set, and classification data about homologous recombination repair defects, resulting in a fitted model.
At step 506, the computing device 110 calculates an accuracy of the fitted model.
At step 508, the computing device 110 determines a target fitting model and a target screening feature percentage based on the accuracy of the fitting model and the screening feature percentage.
For example, using the screened candidate tag feature values, the input data matrix as a test set and classification data on the homologous recombination repair defect, calculating sensitivity (sensitivity) and specificity (specificity) on the homologous recombination repair defect under the combination of the screened candidate tag feature values; the ROC curve is plotted so that the feature values corresponding to the highest sensitivity and the best specificity (the lowest false positive rate) are combined as the target input feature threshold based on the plotted ROC curve. For example, in terms of sensitivity, for example, the feature value combination is: a tag feature value of 3 (e.g., tag 3 is greater than or equal to the predetermined threshold value of 0.75 corresponding to the tag feature value), a tag feature value of 10 (e.g., tag 10 is greater than or equal to the predetermined threshold value of 0.65 corresponding to the tag feature value), a tag feature value of 13 (e.g., tag 13 is greater than or equal to the predetermined threshold value of 0.87 corresponding to the tag feature value, a tag feature value of 15 (e.g., tag 15 is greater than or equal to the predetermined threshold value of 0.78 corresponding to the tag feature value, and a tag feature value of 26 (e.g., tag 26 is greater than or equal to the predetermined threshold value of 0.79 corresponding to the tag feature value), the sensitivity of the combination of feature values is 0.94, the specificity is 0.87, and the maximum AUC =0.95 is achieved.
By adopting the means, the method can select the most favorable characteristic for the accuracy of the prediction result of the homologous recombination repairing defect as the input characteristic of the prediction model, so that the accuracy of the prediction result is improved.
FIG. 6 schematically illustrates a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure. The device 600 may be a device for implementing the method 200 to 500 shown in fig. 2 to 5. As shown, device 600 includes a Central Processing Unit (CPU) 601 that can perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 602 or loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The CPU 601, ROM 602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, a processing unit 601 performs the respective methods and processes described above, e.g. performing the methods 200 to 500. For example, in some embodiments, the methods 200-500 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by CPU 601, one or more of the operations of methods 200-600, 900 and 1200 described above may be performed. Alternatively, in other embodiments, CPU 601 may be configured by any other suitable means (e.g., by way of firmware) to perform one or more acts of methods 200-500.
It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the disclosure are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above are only alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (13)

1. A method for predicting homologous recombination repair defects, comprising:
generating alignment result data about the sample to be tested based on the alignment of the sequencing data about the sample to be tested with the human reference genomic sequence;
determining variant sites based on the alignment result data;
acquiring base sequences of sites in each preset range of the upstream and downstream of a mutation base where the mutation site is located on a human reference genome based on the mutation site so as to generate background map information;
generating label data based on the background map information, the label data comprising a plurality of labels;
determining similarity between the label and tumor mutation label data of a predetermined database so as to generate a label matrix; and
based on the generated tag matrix, homologous recombination repair defects are predicted for the sample to be tested.
2. The method of claim 1, wherein determining a similarity between the label and the tumor mutation label data of the predetermined database to generate a label matrix comprises:
calculating a similarity value between each of a plurality of tags included in the tag data and each of the lesion mutation tag data of the predetermined database;
comparing the calculated similarities to determine a maximum similarity value associated with each tumor mutation signature data; and
generating a label matrix based on the maximum similarity value associated with each tumor mutation label data and the to-be-detected sample identifier, wherein the label matrix indicates the to-be-detected sample identifier and a plurality of label characteristic values associated with the to-be-detected sample identifier, and each label characteristic value in the plurality of label characteristic values is used for indicating the maximum similarity value associated with the corresponding tumor mutation label data.
3. The method of claim 1, wherein predicting, based on the generated tag matrix, a homologous recombination repair defect for a test sample comprises:
generating an input data matrix based on the label matrix and the number of samples to be detected;
determining the percentage of the target screening characteristics;
generating target input features based on the input data matrix and the target screening feature percentage; and
features of the target input features are extracted via the trained predictive model to generate a prediction result regarding homologous recombination repair defects of the sample to be tested.
4. The method of claim 3, wherein determining a target screening feature percentage comprises:
screening the input data matrix based on the screening feature percentage to obtain candidate feature values;
fitting the candidate label characteristic value, the input data matrix serving as the test set and classification data about homologous recombination repairing defects to obtain a fitting model;
calculating the accuracy of the fitting model; and
and determining the target fitting model and the target screening feature percentage based on the accuracy of the fitting model and the screening feature percentage.
5. The method of claim 4, wherein determining a target screening feature percentage comprises:
and determining the screening characteristic percentage corresponding to the maximum accuracy of the fitting model as the target screening characteristic percentage.
6. The method of claim 1, wherein predicting, based on the generated tag matrix, a homologous recombination repair defect for a test sample comprises:
determining a target tag characteristic value in tag data based on the tumor mutation tag data of a predetermined database and the associated cancer species of a sample to be detected;
determining whether the target tag characteristic value meets a preset condition;
in response to the fact that the target label characteristic value meets the preset condition, determining that the to-be-detected sample has the homologous recombination repair defect; and
and determining that the to-be-detected sample does not have the homologous recombination repair defect in response to determining that the target label characteristic value does not meet the predetermined condition.
7. The method of claim 1, wherein determining variant sites based on the alignment result data comprises:
determining candidate variation sites based on the comparison result data; and
filtering the determined candidate mutation sites based on the number of support sequences, the quality value of the mutated base, the mutation frequency and the positive-negative ratio to determine the mutation sites.
8. The method of claim 1, wherein obtaining base sequences of sites within predetermined ranges upstream and downstream of a corresponding mutant base on a human reference genome to generate background map information comprises:
positioning the determined variation site on a human reference genome sequence so as to obtain a base sequence of a site 1bp respectively upstream and downstream of a mutation base where the variation site is located; and
generating background map information based on the base sequence.
9. The method of claim 1, wherein generating tag data based on context map information comprises:
generating tag data via non-negative matrix factorization based on background map information indicating a spectrum of mutations of 96 base sequences, the tag data including a plurality of tag feature values.
10. The method of claim 1, wherein the predetermined database comprises 30 tumor mutation signature features.
11. The method of claim 6, wherein determining that a target tag feature value satisfies a predetermined condition comprises:
determining that the target tag feature value satisfies a predetermined condition in response to determining that any of:
each target tag characteristic value in the tag data is greater than or equal to a preset threshold value; or
And each target label characteristic value in the label data is respectively greater than or equal to each corresponding preset threshold value.
12. A computing device, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 11.
13. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-11.
CN202210226275.1A 2022-03-09 2022-03-09 Method, computing device and medium for predicting homologous recombination repair defects Active CN114694752B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210226275.1A CN114694752B (en) 2022-03-09 2022-03-09 Method, computing device and medium for predicting homologous recombination repair defects

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210226275.1A CN114694752B (en) 2022-03-09 2022-03-09 Method, computing device and medium for predicting homologous recombination repair defects

Publications (2)

Publication Number Publication Date
CN114694752A CN114694752A (en) 2022-07-01
CN114694752B true CN114694752B (en) 2023-03-10

Family

ID=82137253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210226275.1A Active CN114694752B (en) 2022-03-09 2022-03-09 Method, computing device and medium for predicting homologous recombination repair defects

Country Status (1)

Country Link
CN (1) CN114694752B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831219A (en) * 2022-12-22 2023-03-21 郑州思昆生物工程有限公司 Quality prediction method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109219666A (en) * 2016-05-01 2019-01-15 基因组研究有限公司 The mutation label of cancer
CN110010195A (en) * 2018-12-04 2019-07-12 志诺维思(北京)基因科技有限公司 A kind of method and device detecting single nucleotide mutation
CN111292802A (en) * 2020-02-03 2020-06-16 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for detecting sudden change
WO2020137076A1 (en) * 2018-12-28 2020-07-02 国立大学法人 東京大学 Method for predicting susceptibility of cancer to parp inhibitors, and method for detecting cancer having homologous recombination repair deficiency
CN111429968A (en) * 2020-03-11 2020-07-17 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for predicting tumor type
WO2020229694A1 (en) * 2019-05-16 2020-11-19 Fundació Centre De Regulació Genómica Somatic mutation-based classification of cancers

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7368483B2 (en) * 2019-02-12 2023-10-24 テンパス ラブズ,インコーポレイテッド An integrated machine learning framework for estimating homologous recombination defects
US20210047694A1 (en) * 2019-08-16 2021-02-18 The Broad Institute, Inc. Methods for predicting outcomes and treating colorectal cancer using a cell atlas
EP4150113A1 (en) * 2020-05-14 2023-03-22 Guardant Health, Inc. Homologous recombination repair deficiency detection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109219666A (en) * 2016-05-01 2019-01-15 基因组研究有限公司 The mutation label of cancer
CN110010195A (en) * 2018-12-04 2019-07-12 志诺维思(北京)基因科技有限公司 A kind of method and device detecting single nucleotide mutation
WO2020137076A1 (en) * 2018-12-28 2020-07-02 国立大学法人 東京大学 Method for predicting susceptibility of cancer to parp inhibitors, and method for detecting cancer having homologous recombination repair deficiency
WO2020229694A1 (en) * 2019-05-16 2020-11-19 Fundació Centre De Regulació Genómica Somatic mutation-based classification of cancers
CN111292802A (en) * 2020-02-03 2020-06-16 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for detecting sudden change
CN111429968A (en) * 2020-03-11 2020-07-17 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for predicting tumor type

Also Published As

Publication number Publication date
CN114694752A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
Sedlazeck et al. Accurate detection of complex structural variations using single-molecule sequencing
Jiang et al. PRISM: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants
US10354747B1 (en) Deep learning analysis pipeline for next generation sequencing
Alkodsi et al. Comparative analysis of methods for identifying somatic copy number alterations from deep sequencing data
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
US20130184999A1 (en) Systems and methods for cancer-specific drug targets and biomarkers discovery
KR101828052B1 (en) Method and apparatus for analyzing copy-number variation (cnv) of gene
CN111292802A (en) Method, electronic device, and computer storage medium for detecting sudden change
US20230287487A1 (en) Systems and methods for genetic identification and analysis
US20180196924A1 (en) Computer-implemented method and system for diagnosis of biological conditions of a patient
CN114694752B (en) Method, computing device and medium for predicting homologous recombination repair defects
CN113278706B (en) Method for distinguishing somatic mutation from germline mutation
CN111863132A (en) Method and system for screening pathogenic variation
US11335438B1 (en) Detecting false positive variant calls in next-generation sequencing
CN111402955A (en) Biological information measuring method, system, storage medium and terminal
AU2022218581B2 (en) Sequencing data-based itd mutation ratio detecting apparatus and method
CN114708908B (en) Method, computing device and storage medium for detecting micro residual focus of solid tumor
KR20170000743A (en) Method and apparatus for analyzing gene
Chen et al. PSSV: a novel pattern-based probabilistic approach for somatic structural variation identification
CN111028885B (en) Method and device for detecting yak RNA editing site
Balan et al. MICon Contamination Detection Workflow for Next-Generation Sequencing Laboratories Using Microhaplotype Loci and Supervised Learning
US20230298690A1 (en) Genetic information processing system with unbounded-sample analysis mechanism and method of operation thereof
US20230260598A1 (en) Approaches to normalizing genetic information derived by different types of extraction kits to be used for screening, diagnosing, and stratifying patients and systems for implementing the same
US20230335279A1 (en) Approaches to reducing dimensionality of genetic information used for machine learning and systems for implementing the same
US20230282353A1 (en) Multitier classification scheme for comprehensive determination of cancer presence and type based on analysis of genetic information and systems for implementing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant