CN116312781B - Genome instability assessment method and system based on machine learning - Google Patents
Genome instability assessment method and system based on machine learning Download PDFInfo
- Publication number
- CN116312781B CN116312781B CN202310558775.XA CN202310558775A CN116312781B CN 116312781 B CN116312781 B CN 116312781B CN 202310558775 A CN202310558775 A CN 202310558775A CN 116312781 B CN116312781 B CN 116312781B
- Authority
- CN
- China
- Prior art keywords
- genome
- loh
- length
- fragments
- instability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 208000031448 Genomic Instability Diseases 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000010801 machine learning Methods 0.000 title claims abstract description 24
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 48
- 239000000523 sample Substances 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000012795 verification Methods 0.000 claims abstract description 19
- 239000012472 biological sample Substances 0.000 claims abstract description 18
- 230000035772 mutation Effects 0.000 claims abstract description 17
- 238000011156 evaluation Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 7
- 239000012634 fragment Substances 0.000 claims description 67
- 108700028369 Alleles Proteins 0.000 claims description 48
- 210000001519 tissue Anatomy 0.000 claims description 24
- 206010028980 Neoplasm Diseases 0.000 claims description 18
- 230000002779 inactivation Effects 0.000 claims description 14
- 238000012163 sequencing technique Methods 0.000 claims description 12
- 101150072950 BRCA1 gene Proteins 0.000 claims description 11
- 239000012188 paraffin wax Substances 0.000 claims description 11
- 102000036365 BRCA1 Human genes 0.000 claims description 10
- 108700020463 BRCA1 Proteins 0.000 claims description 9
- 108700020462 BRCA2 Proteins 0.000 claims description 9
- 102000052609 BRCA2 Human genes 0.000 claims description 9
- 101150008921 Brca2 gene Proteins 0.000 claims description 9
- 102100034483 DNA repair protein RAD51 homolog 4 Human genes 0.000 claims description 8
- 101001132266 Homo sapiens DNA repair protein RAD51 homolog 4 Proteins 0.000 claims description 8
- 239000008280 blood Substances 0.000 claims description 8
- 210000004369 blood Anatomy 0.000 claims description 8
- 230000003321 amplification Effects 0.000 claims description 7
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 7
- 238000007689 inspection Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 210000004881 tumor cell Anatomy 0.000 claims description 6
- 238000010200 validation analysis Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 238000007400 DNA extraction Methods 0.000 claims description 4
- 102100033962 GTP-binding protein RAD Human genes 0.000 claims description 3
- 101001132495 Homo sapiens GTP-binding protein RAD Proteins 0.000 claims description 3
- 210000000349 chromosome Anatomy 0.000 claims description 3
- 230000006607 hypermethylation Effects 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 20
- 238000004422 calculation algorithm Methods 0.000 abstract description 10
- 108091007743 BRCA1/2 Proteins 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 6
- 229940079593 drug Drugs 0.000 abstract description 4
- 239000003814 drug Substances 0.000 abstract description 4
- 230000034431 double-strand break repair via homologous recombination Effects 0.000 description 18
- 238000001514 detection method Methods 0.000 description 17
- 108020004414 DNA Proteins 0.000 description 12
- WSFSSNUMVMOOMR-UHFFFAOYSA-N Formaldehyde Chemical compound O=C WSFSSNUMVMOOMR-UHFFFAOYSA-N 0.000 description 8
- 239000012661 PARP inhibitor Substances 0.000 description 5
- 229940121906 Poly ADP ribose polymerase inhibitor Drugs 0.000 description 5
- 210000004027 cell Anatomy 0.000 description 5
- 230000008439 repair process Effects 0.000 description 5
- 238000007405 data analysis Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 238000004321 preservation Methods 0.000 description 4
- 238000003908 quality control method Methods 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 108700040618 BRCA1 Genes Proteins 0.000 description 3
- 206010033128 Ovarian cancer Diseases 0.000 description 3
- 206010061535 Ovarian neoplasm Diseases 0.000 description 3
- 229920000776 Poly(Adenosine diphosphate-ribose) polymerase Polymers 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000012361 double-strand break repair Effects 0.000 description 3
- 230000037361 pathway Effects 0.000 description 3
- LTZZZXXIKHHTMO-UHFFFAOYSA-N 4-[[4-fluoro-3-[4-(4-fluorobenzoyl)piperazine-1-carbonyl]phenyl]methyl]-2H-phthalazin-1-one Chemical compound FC1=C(C=C(CC2=NNC(C3=CC=CC=C23)=O)C=C1)C(=O)N1CCN(CC1)C(C1=CC=C(C=C1)F)=O LTZZZXXIKHHTMO-UHFFFAOYSA-N 0.000 description 2
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 2
- 108010067741 Fanconi Anemia Complementation Group N protein Proteins 0.000 description 2
- 102000016627 Fanconi Anemia Complementation Group N protein Human genes 0.000 description 2
- 101710179684 Poly [ADP-ribose] polymerase Proteins 0.000 description 2
- 102100023712 Poly [ADP-ribose] polymerase 1 Human genes 0.000 description 2
- 230000001594 aberrant effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 2
- 230000010100 anticoagulation Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000001502 gel electrophoresis Methods 0.000 description 2
- 238000011418 maintenance treatment Methods 0.000 description 2
- 230000011987 methylation Effects 0.000 description 2
- 238000007069 methylation reaction Methods 0.000 description 2
- FAQDUNYVKQKNLD-UHFFFAOYSA-N olaparib Chemical compound FC1=CC=C(CC2=C3[CH]C=CC=C3C(=O)N=N2)C=C1C(=O)N(CC1)CCN1C(=O)C1CC1 FAQDUNYVKQKNLD-UHFFFAOYSA-N 0.000 description 2
- 229960000572 olaparib Drugs 0.000 description 2
- BASFCYQUMIYNBI-UHFFFAOYSA-N platinum Chemical compound [Pt] BASFCYQUMIYNBI-UHFFFAOYSA-N 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 231100000241 scar Toxicity 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000011282 treatment Methods 0.000 description 2
- 102000000872 ATM Human genes 0.000 description 1
- 102000007347 Apyrase Human genes 0.000 description 1
- 108010007730 Apyrase Proteins 0.000 description 1
- 108010004586 Ataxia Telangiectasia Mutated Proteins Proteins 0.000 description 1
- 108700010154 BRCA2 Genes Proteins 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 102100038111 Cyclin-dependent kinase 12 Human genes 0.000 description 1
- 230000033616 DNA repair Effects 0.000 description 1
- 102100034484 DNA repair protein RAD51 homolog 3 Human genes 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 101000884345 Homo sapiens Cyclin-dependent kinase 12 Proteins 0.000 description 1
- 101001132271 Homo sapiens DNA repair protein RAD51 homolog 3 Proteins 0.000 description 1
- 101000777277 Homo sapiens Serine/threonine-protein kinase Chk2 Proteins 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 102100031075 Serine/threonine-protein kinase Chk2 Human genes 0.000 description 1
- 239000002246 antineoplastic agent Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000003850 cellular structure Anatomy 0.000 description 1
- 229940044683 chemotherapy drug Drugs 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000005782 double-strand break Effects 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 230000006801 homologous recombination Effects 0.000 description 1
- 238000002744 homologous recombination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 231100000518 lethal Toxicity 0.000 description 1
- 230000001665 lethal effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000010534 mechanism of action Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 230000000149 penetrating effect Effects 0.000 description 1
- 229910052697 platinum Inorganic materials 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000013102 re-test Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000037390 scarring Effects 0.000 description 1
- 230000033443 single strand break repair Effects 0.000 description 1
- 230000003007 single stranded DNA break Effects 0.000 description 1
- 230000005783 single-strand break Effects 0.000 description 1
- 102000055501 telomere Human genes 0.000 description 1
- 108091035539 telomere Proteins 0.000 description 1
- 210000003411 telomere Anatomy 0.000 description 1
- 239000001993 wax Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a genome instability assessment method and a genome instability assessment system based on machine learning, wherein the method comprises the following steps: collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample; dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model; training a genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes; the genomic instability is assessed based on a plurality of indicators of genomic instability. The invention adopts a more complex and accurate machine learning model algorithm to replace the original direct addition algorithm; modeling criteria include BRCA1/2 and other HRR genes that have good performance in terms of mutation rate, correlation with genomic instability, correlation with drug efficacy and thus can be incorporated; thereby obtaining better analysis and evaluation effects of genome instability through a precise machine learning modeling method.
Description
Technical Field
The invention relates to the technical field of medical treatment, in particular to a genome instability assessment method and system based on machine learning.
Background
The homologous recombination repair defect (HRD) state is a key index for the treatment selection and prognosis of various tumors, and clinical research results prove that the HRD state is highly related to the sensitivity of platinum chemotherapeutic drugs and PARP inhibitors. HRD detection has been currently approved by the FDA as a concomitant diagnostic marker for ovarian cancer patients using olaparib and nilaparil. Olaparib is the first PARP inhibitor on the market worldwide and domestically, and is approved in patients with ovarian cancer, breast cancer, prostate cancer, pancreatic cancer, and the like, respectively. HRD is reported to be present in every two patients among ovarian cancer patients. Compared with BRCA mutation, the HRD detection can improve the sensitivity of PARP inhibitor to the population. Complex DNA repair systems exist in normal cells, including PARP (polymerase apyrase) to repair DNA single strand breaks and the DNA Homologous Recombination Repair (HRR) pathways where proteins such as BRCA1, BRCA2 and PALB2 repair DNA double strand breaks. Homologous Recombination Repair (HRR) is an important mechanism for DNA double strand break repair. In the DNA Double Strand Break Repair (DSBR) pathway, BRCA1 and BRCA2 genes are two key genes of the homologous recombination repair pathway, and if BRCA1 or BRCA2 is mutated to cause loss of protein function, it will cause HRD which is a defect in homologous recombination repair function, and furthermore, mutation of these genes or methylation of BRCA1 gene promoter to cause HRD, it will cause genomic instability, which is manifested as "genomic scar", including LOH (loss of genomic heterozygosity), TAI (telomere allele imbalance) and LST (large fragment migration).
PARP (poly (adenosine diphosphate) ribose polymerase) is an enzyme critical in DNA single strand break repair, responsible for DNA single strand damage repair. If the PARP inhibitor is used to block the single-stranded repair function of DNA, cells carrying single-stranded mutations will cause double-stranded DNA breaks after replication and proliferation, and if the cells have Homologous Repair Defects (HRDs) at the same time, a large number of double-stranded breaks cannot be repaired and the cells die. This mechanism of action of PARP inhibitors is known as the "synthetic lethal" effect.
HRD results in genomic instability, and HRD score detection, which is manifested as "genomic scarring", is a currently accepted method of assessing HRD status. The HRD score integrates LOH, LST, TAI indexes to score the genome instability, and specific values are obtained by detecting and calculating single nucleotide polymorphism Sites (SNP) in cells. LOH, LST, TAI are all independent predictors of genome stability, and HRD scores (HRD score) are obtained by simple addition of these three indices, and determining HRD score thresholds by 95% recognition sensitivity to BRCA1/2 biallelic inactivation is a current common practice to reflect the state of genome instability.
However, the current common practice has the following technical drawbacks:
(1) Three methods for calculating the gene instability evaluation index LOH, LST, TAI have been developed for many years, and based on the project and scientific experience of many years, the genome scar index has room for improvement in terms of quantity and definition;
(2) The method of directly adding the gene instability evaluation indexes to obtain the HRD score is simple and direct, but cannot accurately obtain a better analysis effect;
(3) Modeling is trained by using BRCA1/2 bi-allelic inactivation as a standard, and the contribution of other genes of the HRR channel to homologous recombination function deletion is not considered, so that when other HRR related genes are mutated or gene promoters are methylated, the condition of unstable genome is not in the evaluation range of instability, and the evaluation result is not accurate enough.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a genome instability assessment method and a genome instability assessment system based on machine learning, which are combined with project and scientific experience and latest scientific achievements in the field of genome instability, and redesign indexes for assessing genome instability so as to enable the indexes to be more comprehensive and detailed; attempting more complex and accurate machine learning model algorithms to replace the original direct addition algorithm; the modeling standard is selected to discover other important HRR genes which can be included besides BRCA1/2, so that the method has good performances in mutation rate, correlation with genome instability and correlation with drug efficacy, obtains better genome instability analysis and evaluation effects through a more accurate machine learning modeling method, and is particularly suitable for patients needing further evaluation of HDR states when the detection result of the BRCA1/2 is negative.
The first aspect of the present invention provides a machine learning-based genome instability assessment method, comprising:
s1, collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
s2, dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model;
s3, training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes;
s4, evaluating the genome instability based on a plurality of genome instability indexes.
Preferably, the genomic sample comprises a fresh blood sample, a paraffin section sample and/or a fresh tissue sample; the process comprises: and (3) performing tumor content assessment, DNA extraction, quality inspection, library construction, capturing and on-machine sequencing on the biological sample.
Preferably, the training set and the verification set in S2 are independent of each other, and the sample sizes of the training set and the verification set are between 450-500.
Preferably, the modeling in S2 to obtain a genome instability assessment model based on the training set and the validation set includes modeling using ridge regression.
Preferably, the plurality of HRR genes in S3 include: HRR3 gene set of three genes consisting of BRCA1, BRCA2 and RAD 51D.
Preferably, the modeling criteria in S3 include: any one of a first set of three conditions is satisfied, defined as a sample in the model being a true positive sample, the first set of three conditions comprising:
(1) BRCA1 bi-allelic inactivation;
(2) BRCA2 bi-allelic inactivation;
(3) RAD51D bi-allelic inactivation;
wherein a gene within the genome satisfies any one of a second set of three conditions, defined as the biallelic inactivation, comprising:
(1) One allele is a 4/5 class mutation, and the other allele is LOH;
(2) Two 4/5 type mutations occurred in the same gene;
(3) One allele is a 4/5 type mutation and the other allele is hypermethylated.
Preferably, the plurality of genome instability indexes in S4 include:
alleles (alleles) are divided into three classes: allelic equilibrium but amplification, non-LOH but allelic imbalance and LOH; simultaneously dividing three types of alleles into five length intervals according to absolute lengths: 0-5M,5-10M,10-15M,15-20M, >20M; wherein an allele (allele) is a pair of genes that control relative traits at the same position on a pair of homologous chromosomes;
the plurality of genome instability indexes comprise 19 genome instability indexes, which are respectively:
(1) b_0-5M: the number of fragments with the length of 0-5M (inclusive) with balanced alleles but amplified;
(2) b_5-10M: the number of fragments with length of 5 (no) -10M (inclusive) that are balanced but amplified by the allele;
(3) b_10-15M: the number of fragments with length of 10 (no) -15M (inclusive) that are balanced but amplified by the alleles;
(4) b_15-20M: the number of fragments with length of 15 (no) -20M (inclusive) that are balanced but amplified by the alleles;
(5) b_ >20M: the number of fragments with a length of more than 20M, balanced but amplified alleles;
(6) imb _0-5M: a number of fragments of length 0-5M (inclusive) that are not LOH but are allelic imbalanced;
(7) imb _5-10M: a number of fragments of length 5 (no) to 10M (inclusive) that are non-LOH but allelic imbalances;
(8) imb _10-15M: a number of fragments of length 10 (none) -15M (inclusive) that are non-LOH but allelic imbalances;
(9) imb _15-20M: a number of fragments of length 15 (no) to 20M (inclusive) that are non-LOH but allelic imbalances;
(10) imb _ >20M: a number of fragments that are not LOH but are allelic imbalanced, greater than 20M in length;
(11) loh _0-5M: LOH, the number of fragments with length of 0-5M (inclusive);
(12) loh _5-10M: LOH, number of fragments ranging in length from 5 (inclusive) to 10M (inclusive);
(13) loh _10-15M: LOH, number of fragments ranging in length from 10 (inclusive) to 15M (inclusive);
(14) loh _15-20M: LOH, number of fragments ranging in length from 15 (inclusive) to 20M (inclusive);
(15) loh _ >20M: LOH, number of fragments greater than 20M in length;
(16) purity: tumor cell fraction;
(17) ploidy: tumor cell genome ploidy;
(18) si: for measuring the heterogeneity of the state of anomaly CN (aberrant CN), the acquisition method includes: fragments (segments) with all alleles not 1:1 were counted; weighting the allelic state of the fragment by its length; calculating a diversity index of allelic states in which copy number variation (Copy Number Variation, CNV) occurs throughout the sample;
(19) hlamp: the acquisition method comprises the following steps: fragments (fragments) located in the high amplification region (including 1q21.1-24.1, 1q42.2-44, 8q11.21-24.3 and 10p15.3-14) were calculated, and the allele status of copy number variation (Copy Number Variation, CNV). Gtoreq.5 was used as a proportion of each fragment in the region.
In a second aspect of the present invention, there is provided a machine learning-based genome instability assessment system comprising:
the sample collection module is used for collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
the model modeling module is used for dividing the genome sample into a training set and a verification set, and modeling is carried out based on the training set and the verification set to obtain a genome instability assessment model;
the model training module is used for training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes;
and the instability evaluation module is used for evaluating the genome instability based on a plurality of genome instability indexes.
A third aspect of the invention provides an electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being for reading the instructions and performing the method according to the first aspect.
A fourth aspect of the invention provides a computer readable storage medium storing a plurality of instructions readable by a processor and for performing the method of the first aspect.
The genome instability assessment method, system and electronic equipment based on machine learning provided by the invention have the following beneficial effects:
(1) The original 3 genome instability indexes are improved to 19, the evaluation of genome instability is more comprehensive and detailed, and the genome instability can be evaluated more accurately.
(2) The modeling method of ridge regression is adopted in the modeling method, and compared with a simple index addition algorithm, the modeling method is more accurate. The training set and the verification set adopted by the modeling group are mutually independent, and the sample size is between 450 and 500, so that the accuracy of the model is improved.
(3) On modeling standard, RAD51D is added except for classical BRCA1/2 genes, so that a HRR3 gene set of 3 genes is formed; meanwhile, more gene sets formed by genes can be formed according to the requirement, and the analysis result shows that RAD51D is an HRR gene which deserves consideration in terms of mutation rate, association with genome instability and association with drug efficacy. The sensitivity of the genome instability assessment method can reach about 92% and the specificity is about 40% (representing that about 60% of patients with potential better curative effects from PARPi maintenance treatment can be screened out by the method in non-HRR 3 bi-allelic mutant population) through analysis performance verification.
(4) Clinical performance shows that the genome instability assessment method has slightly better distinguishing ability for the first-line PARPi maintenance treatment effect than the prior art.
Drawings
FIG. 1 is a schematic flow chart of a genome instability assessment method based on machine learning according to the present invention.
Fig. 2 is a schematic block diagram of a genome instability evaluation system based on machine learning according to the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.
The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.
The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.
The display screen is used for displaying a user interface of each application program.
In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.
Example 1
As shown in fig. 1, the present embodiment provides a genome instability assessment method based on machine learning, which includes the following steps.
S1, collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample. As a preferred embodiment, the genomic sample comprises a fresh blood sample, a paraffin section sample and/or a fresh tissue sample.
In this embodiment, collecting and receiving a fresh blood sample includes: anticoagulation of 2ml with EDTA based on the detection criteria, collecting and receiving a fresh blood sample; collecting and receiving paraffin section samples includes: based on the detection standard, the thickness is 5 mu m and is larger than 1cm 2 At least 6 and at least 10 samples of paraffin sections are collected and received from the surgical tissue and the penetrated tissue; collecting and receiving a fresh tissue sample includes: and (3) respectively placing the fresh tissue samples with the detection standard into a 10% formalin preservation tube and an RNAlater preservation tube for inspection, and collecting and receiving the fresh tissue samples, wherein 1 tissue per tube is more than or equal to 1cm.
As a preferred embodiment, the process comprises: and (3) performing tumor content assessment, DNA extraction, quality inspection, library construction, capturing and on-machine sequencing on the biological sample.
In this embodiment, wherein: assessing tumor content of the biological sample includes: the tumor tissue is required to be subjected to tumor content evaluation, and the qualified standard of the tumor content is more than or equal to 20 percent. If the tumor content is more than or equal to 10% and less than 20%, risk detection can be performed, and if the tumor content is less than 10%, detection is stopped.
DNA extraction and quality inspection include: (1) For a fresh blood sample or a fresh tissue sample, the total quantity of the Qubit detection DNA is more than or equal to 200ng, and the main band of the gel electrophoresis detection DNA is more than or equal to 10Kbp; (2) The total amount of DNA required for FFPE (Formalin Fixed Paraffin Embedded, formalin-fixed paraffin embedded) or paraffin blocks formed for paraffin section samples met the minimum library inventory initiation amount at one time; this is because, in particular, FFPE samples can preserve tissues at normal temperature for a long time or can be used for preparing tissue specimens required for examination (preparation method: to preserve the integrity of cellular structures, formalin is used for fixing tissue samples first, and paraffin is used for embedding, so that FFPE samples are formed by conveniently slicing tissue samples), which are easy to obtain a lower amount of nucleic acid, cannot exert its own value to the greatest extent.
Library construction and capture included: constructing and capturing a library, and ensuring that the concentration of the library is more than or equal to 0.5 ng/. Mu.l. The on-machine sequencing comprises the following steps: and (3) performing on-machine sequencing based on a sequencing platform Novaseq6000, and ensuring that the off-machine sequencing amount of tumor tissues is 6G, and comparing the off-machine sequencing amount with 2G.
S2, dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model. In this embodiment, the training set and the validation set are independent of each other, and the sample sizes of the training set and the validation set are between 450-500. In this embodiment, the modeling based on the training set and the verification set to obtain the genome instability evaluation model includes modeling by using ridge regression, which is more accurate than a simple index addition algorithm. The ridge regression is essentially a regression method for improving the common least square method and giving up the unbiasedness of the least square method, obtaining the regression coefficient at the cost of losing part of information and reducing accuracy and more conforming to the actual condition of the data set, and the fitting property of the ridge regression method on the data set with deviation data is obviously better than that of a linear regression model using the least square method.
Of course, those skilled in the art will appreciate that the modeling includes a classification loop configured as a machine learning classifier, which may also select one of a Linear Discriminant Analysis (LDA) classifier, a Quadratic Discriminant Analysis (QDA) classifier, a Support Vector Machine (SVM) classifier, a Random Forest (RF) classifier, a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, an elastic network algorithm classifier, a sequence minimum optimization algorithm classifier, a naive bayes algorithm classifier, and an NMF predictor algorithm classifier other than ridge regression.
And S3, training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes. In this example, as a preferred embodiment, the plurality of genes includes: HRR3 gene set of three genes consisting of BRCA1, BRCA2 and RAD 51D. Based on the internal analysis results, displaying: mutation of RAD51D or methylation of BRCA1 gene promoter causes HRD, resulting in genome instability, and RAD51D is an HRR gene worth considering from the perspective of mutation rate, association with genome instability, and association with drug efficacy.
Of course, one skilled in the art could also select one or more of PALB2, CDK12, RAD51C, CHEK2 and ATM, together with BRCA1, BRCA2 and RAD51D, to construct a new genome.
As a preferred embodiment, the modeling criteria include: any one of a first set of three conditions is satisfied, defined as a sample in the model being a true positive sample, the first set of three conditions comprising: (1) BRCA1 biallelic inactivation; (2) BRCA2 bi-allelic inactivation; (3) RAD51D biallelic inactivation.
Wherein a gene within the genome satisfies any one of a second set of three conditions, defined as the biallelic inactivation, comprising: (1) One allele is a 4/5 class mutation, and the other allele is LOH; (2) two class 4/5 mutations have occurred in the same gene; (3) One allele is a 4/5 type mutation and the other allele is hypermethylated.
S4, evaluating the genome instability based on a plurality of genome instability indexes. As a preferred embodiment, the plurality of genome unstable indicators comprises: alleles (alleles) are divided into three classes: allelic equilibrium but amplification, non-LOH but allelic imbalance and LOH; simultaneously dividing three types of alleles into five length intervals according to absolute lengths: 0-5M,5-10M,10-15M,15-20M, >20M; wherein an allele (allele) is a pair of genes that control relative traits at the same position on a pair of homologous chromosomes.
In this embodiment, the plurality of genome unstable indexes includes 19 genome unstable indexes, which are respectively: (1) b_0-5M: the number of fragments with the length of 0-5M (inclusive) with balanced alleles but amplified; (2) b_5-10M: the number of fragments with length of 5 (no) -10M (inclusive) that are balanced but amplified by the allele; (3) b_10-15M: the number of fragments with length of 10 (no) -15M (inclusive) that are balanced but amplified by the alleles; (4) b_15-20M: the number of fragments with length of 15 (no) -20M (inclusive) that are balanced but amplified by the alleles; (5) b_ >20M: the number of fragments with a length of more than 20M, balanced but amplified alleles; (6) imb _0-5M: a number of fragments of length 0-5M (inclusive) that are not LOH but are allelic imbalanced; (7) imb _5-10M: a number of fragments of length 5 (no) to 10M (inclusive) that are non-LOH but allelic imbalances; (8) imb _10-15M: a number of fragments of length 10 (none) -15M (inclusive) that are non-LOH but allelic imbalances; (9) imb _15-20M: a number of fragments of length 15 (no) to 20M (inclusive) that are non-LOH but allelic imbalances; (10) imb _ >20M: a number of fragments that are not LOH but are allelic imbalanced, greater than 20M in length; (11) loh _0-5M: LOH, the number of fragments with length of 0-5M (inclusive); (12) loh _5-10M: LOH, number of fragments ranging in length from 5 (inclusive) to 10M (inclusive); (13) loh _10-15M: LOH, number of fragments ranging in length from 10 (inclusive) to 15M (inclusive); (14) loh _15-20M: LOH, number of fragments ranging in length from 15 (inclusive) to 20M (inclusive); (15) loh _ >20M: LOH, number of fragments greater than 20M in length; (16) purity: tumor cell fraction; (17) ploidy: tumor cell genome ploidy; (18) si: for measuring the heterogeneity of the state of anomaly CN (aberrant CN), the acquisition method includes: fragments (segments) with all alleles not 1:1 were counted; weighting the allelic state of a segment (segment) by the length of the segment (segment); calculating a diversity index of allelic states in which copy number variation (Copy Number Variation, CNV) occurs throughout the sample; (19) hlamp: the acquisition method comprises the following steps: fragments (fragments) located in the high amplification region (including 1q21.1-24.1, 1q42.2-44, 8q11.21-24.3 and 10p15.3-14) were calculated, and the allele status of copy number variation (Copy Number Variation, CNV). Gtoreq.5 was used as a proportion of each fragment in the region.
The experimental operation part of this example is as follows:
1. and (5) collecting and receiving samples.
1. Fresh blood sample: the detection standard is EDTA anticoagulation 2ml.
2. Paraffin section samples: the detection standard is 5 μm thick, more than 1cm 2 At least 6 pieces of surgical tissue, and at least 10 pieces of penetrating tissue.
3. Fresh tissue samples: the tissue is respectively put into a 10% formalin preservation tube and an RNAlater preservation tube for inspection, and 1 piece of tissue per tube is more than or equal to 1cm.
2. And (5) assessing tumor content.
The tumor tissue is required to be subjected to tumor content evaluation, and the qualified standard of the tumor content is more than or equal to 20 percent. If the tumor content is more than or equal to 10% and less than 20%, risk detection can be performed, and if the tumor content is less than 10%, detection is stopped.
3. And (5) extracting sample DNA and detecting quality.
1. Blood or fresh tissue needs to meet the requirement that the total quantity of the Qubit detection DNA is more than or equal to 200ng, and the main band of the gel electrophoresis detection DNA is more than or equal to 10Kbp.
2. The total amount of FFPE/wax block required DNA meets the minimum initial amount of library establishment once.
4. Library construction and capture: the concentration of the library to be discharged is more than or equal to 0.5 ng/. Mu.l.
5. Sequencing on a machine: the sequencing platform Novaseq6000, the tumor tissue off-machine sequencing amount is 6G, and the control off-machine sequencing amount is 2G.
The raw letter analysis section of the present embodiment:
1. the data analysis is started after the data analysis personnel receives the data management data off-line notification. 2. And extracting the item numbers, the corresponding subject screening numbers and the corresponding data paths according to the data docking table provided by the data manager, and writing the item numbers, the corresponding subject screening numbers and the corresponding data paths into a standard input format required by an automatic analysis flow. 3. The data analysis flow is started, the data analysis process is generally completed in 5-8 hours, after the completion, a letter generation analyst needs to check analysis quality control results, and the judgment is carried out by combining with a control standard: if the quality control is passed, the report analysis record needs to be filled in. If analysis is interrupted or quality control is not passed, the analysis is processed in an exception handling mode. 4. Exception handling: if analysis is interrupted, the cause of the interruption of the analysis flow is confirmed first. If the analysis is interrupted due to external factors, such as power failure of an external machine room, hardware faults and the like, after the faults are removed, the intermediate file folder is renamed and analyzed again according to the analysis flow. The original folder name is renamed as "original folder name-number of analysis". If the quality control is not passed, the data manager is contacted, the complement measurement or the retest is carried out, and the analysis is restarted. 5. A report is generated.
Example two
Referring to fig. 2, the present embodiment provides a machine learning-based genome instability assessment system, comprising: a sample collection module 101, configured to collect and receive a biological sample, and process the biological sample to obtain a genomic sample; the model modeling module 102 is configured to divide the genome sample into a training set and a verification set, and perform modeling based on the training set and the verification set to obtain a genome instability assessment model; a model training module 103 for training the genome instability assessment model based on a set of genes formed by a plurality of HRR genes to form a modeling standard; and a instability assessment module 104 for assessing genomic instability based on a plurality of genomic instability indicators.
The system may implement the evaluation method provided in the first embodiment, and the specific evaluation method may be referred to the description in the first embodiment, which is not repeated here.
The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.
As shown in fig. 3, the present invention further provides an electronic device, including a processor 301 and a memory 302 connected to the processor 301, where the memory 302 stores a plurality of instructions, and the instructions may be loaded and executed by the processor, so that the processor can execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (9)
1. A machine learning-based method for assessing genomic instability, comprising:
s1, collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
s2, dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model;
s3, training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes; the modeling criteria in S3 include: any one of a first set of three conditions is satisfied, defined as a sample in the model being a true positive sample, the first set of three conditions comprising:
(1) BRCA1 bi-allelic inactivation;
(2) BRCA2 bi-allelic inactivation;
(3) RAD51D bi-allelic inactivation;
wherein a gene within the genome satisfies any one of a second set of three conditions, defined as the biallelic inactivation, comprising:
(1) One allele is a 4/5 class mutation, and the other allele is LOH;
(2) Two 4/5 type mutations occurred in the same gene;
(3) One allele is a 4/5 class mutation, and the other allele is hypermethylation state;
s4, evaluating the genome instability based on a plurality of genome instability indexes.
2. The machine learning based genomic instability assessment method of claim 1, wherein the genomic samples comprise fresh blood samples, paraffin section samples and/or fresh tissue samples; the process comprises: and (3) performing tumor content assessment, DNA extraction, quality inspection, library construction, capturing and on-machine sequencing on the biological sample.
3. The machine learning based genome instability assessment method of claim 2, wherein the training set and the validation set are independent of each other in S2 and the sample size of the training set and the validation set is between 450-500.
4. A machine learning based genome instability assessment method according to claim 3, wherein the modeling based on the training set and the validation set in S2 to obtain a genome instability assessment model comprises modeling using ridge regression.
5. The machine learning based genome instability assessment method of claim 4, wherein the plurality of HRR genes in S3 comprises: HRR3 gene set of three genes consisting of BRCA1, BRCA2 and RAD 51D.
6. The machine learning based genome instability assessment method of claim 5, wherein the plurality of genome instability indices in S4 comprise:
alleles are divided into three categories: allelic equilibrium but amplification, non-LOH but allelic imbalance and LOH; simultaneously dividing three types of alleles into five length intervals according to absolute lengths: 0-5M,5-10M,10-15M,15-20M, >20M; wherein alleles are a pair of genes that control relative traits at the same position on a pair of homologous chromosomes;
the plurality of genome instability indexes comprise 19 genome instability indexes, which are respectively:
(1) b_0-5M: the number of fragments with the length of 0-5M with balanced alleles but amplified; wherein the fragment has a length of 5M;
(2) b_5-10M: the number of fragments with the length of 5-10M which are balanced but amplified by the alleles; wherein the fragment is 5M free but 10M long;
(3) b_10-15M: the number of fragments with the length of 10-15M with balanced alleles but amplified; wherein the fragment is 10M free but 15M long;
(4) b_15-20M: the number of fragments with the length of 15-20M is balanced but amplified by the alleles; wherein the fragment is not 15M long but 20M long;
(5) b_ >20M: the number of fragments with a length of more than 20M, balanced but amplified alleles;
(6) imb _0-5M: number of fragments of length 0-5M that are non-LOH but allelic imbalances; wherein the fragment has a length of 5M;
(7) imb _5-10M: a number of fragments that are not LOH but are allelic imbalanced, ranging in length from 5 to 10M; wherein the fragment is 5M free but 10M long;
(8) imb _10-15M: a number of fragments that are not LOH but are allelic imbalanced, ranging in length from 10 to 15M; wherein the fragment is 10M free but 15M long;
(9) imb _15-20M: a number of fragments that are not LOH but are allelic imbalanced, ranging in length from 15 to 20M; wherein the fragment is not 15M long but 20M long;
(10) imb _ >20M: a number of fragments that are not LOH but are allelic imbalanced, greater than 20M in length;
(11) loh _0-5M: LOH, number of fragments with length of 0-5M; wherein the fragment has a length of 5M;
(12) loh _5-10M: LOH, the number of fragments with the length of 5-10M; wherein the fragment is 5M free but 10M long;
(13) loh _10-15M: LOH, the number of fragments with the length of 10-15M; wherein the fragment is 10M free but 15M long;
(14) loh _15-20M: LOH, the number of fragments with the length of 15-20M; wherein the fragment is not 15M long but 20M long;
(15) loh _ >20M: LOH, number of fragments greater than 20M in length;
(16) purity: tumor cell fraction;
(17) ploidy: tumor cell genome ploidy;
(18) si: the acquisition method for measuring the heterogeneity of the state of the averrantCN comprises the following steps: segments were counted for all alleles that were not 1:1; weighting the allele state of segments by their length; calculating the diversity index of the allelic state of the copy number variation CNV in the whole sample;
(19) hlamp: the acquisition method comprises the following steps: calculating segments in a high amplification region, wherein the proportion of allele states with copy number variation CNV more than or equal to 5 in each segment in the region; wherein the high amplification region comprises 1q21.1-24.1, 1q42.2-44, 8q11.21-24.3 and 10p15.3-14.
7. A machine learning based genome instability assessment system for implementing the assessment method according to any of claims 1 to 6, comprising:
a sample collection module (101) for collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
a model modeling module (102) for dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model;
a model training module (103) for training the genome instability assessment model based on a set of genes formed by a plurality of HRR genes to form a modeling standard;
and a instability assessment module (104) for assessing genomic instability based on the plurality of genomic instability indicators.
8. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the instructions and perform the evaluation method of any one of claims 1-6.
9. A computer readable storage medium storing a plurality of instructions readable by a processor and for performing the evaluation method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310558775.XA CN116312781B (en) | 2023-05-17 | 2023-05-17 | Genome instability assessment method and system based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310558775.XA CN116312781B (en) | 2023-05-17 | 2023-05-17 | Genome instability assessment method and system based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116312781A CN116312781A (en) | 2023-06-23 |
CN116312781B true CN116312781B (en) | 2023-08-18 |
Family
ID=86817137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310558775.XA Active CN116312781B (en) | 2023-05-17 | 2023-05-17 | Genome instability assessment method and system based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116312781B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112164420A (en) * | 2020-09-07 | 2021-01-01 | 厦门艾德生物医药科技股份有限公司 | Method for establishing genome scar model |
CN112164422A (en) * | 2020-10-12 | 2021-01-01 | 郑州大学第一附属医院 | Grading method for quantifying TIME infiltration mode |
CN112930569A (en) * | 2018-08-31 | 2021-06-08 | 夸登特健康公司 | Microsatellite instability detection in cell-free DNA |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP4360094A1 (en) * | 2021-06-25 | 2024-05-01 | Foundation Medicine, Inc. | System and method of classifying homologous repair deficiency |
-
2023
- 2023-05-17 CN CN202310558775.XA patent/CN116312781B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112930569A (en) * | 2018-08-31 | 2021-06-08 | 夸登特健康公司 | Microsatellite instability detection in cell-free DNA |
CN112164420A (en) * | 2020-09-07 | 2021-01-01 | 厦门艾德生物医药科技股份有限公司 | Method for establishing genome scar model |
CN112164422A (en) * | 2020-10-12 | 2021-01-01 | 郑州大学第一附属医院 | Grading method for quantifying TIME infiltration mode |
Also Published As
Publication number | Publication date |
---|---|
CN116312781A (en) | 2023-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bejar | Clinical and genetic predictors of prognosis in myelodysplastic syndromes | |
Park et al. | Sequence-based association and selection scans identify drug resistance loci in the Plasmodium falciparum malaria parasite | |
Caballero et al. | The nature of genetic variation for complex traits revealed by GWAS and regional heritability mapping analyses | |
Zeggini et al. | Meta-analysis in genome-wide association studies | |
KR102638152B1 (en) | Verification method and system for sequence variant calling | |
EP4036247B1 (en) | Methods to detect rare mutations and copy number variation | |
Chen et al. | Identification of selective sweeps reveals divergent selection between Chinese Holstein and Simmental cattle populations | |
Onecha et al. | A novel deep targeted sequencing method for minimal residual disease monitoring in acute myeloid leukemia | |
AU2020398913A1 (en) | Systems and methods for predicting homologous recombination deficiency status of a specimen | |
US20210238695A1 (en) | Methods of mast cell tumor prognosis and uses thereof | |
CN115083521B (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
CN117947163A (en) | Method for evaluating background level of variant nucleic acid sample | |
WO2019046804A1 (en) | Identifying false positive variants using a significance model | |
CN105483210A (en) | RNA (ribonucleic acid) editing locus detection method | |
Piening et al. | Whole transcriptome profiling of prospective endomyocardial biopsies reveals prognostic and diagnostic signatures of cardiac allograft rejection | |
Gaksch et al. | Residual disease detection using targeted parallel sequencing predicts relapse in cytogenetically normal acute myeloid leukemia | |
Yang et al. | A systematic comparison of normalization methods for eQTL analysis | |
CN116312781B (en) | Genome instability assessment method and system based on machine learning | |
CN114990202B (en) | Application of SNP (Single nucleotide polymorphism) locus in evaluation of genome abnormality and method for evaluating genome abnormality | |
CN107885972A (en) | It is a kind of based on the fusion detection method of single-ended sequencing and its application | |
CN114694752B (en) | Method, computing device and medium for predicting homologous recombination repair defects | |
Cook et al. | A deep-learning-based RNA-seq germline variant caller | |
Aguet et al. | Transcriptomic signatures across human tissues identify functional rare genetic variation | |
CN115762800A (en) | Scoring system capable of predicting melanoma patient prognosis and immunotherapy response rate | |
Fettke et al. | Analytical validation of an error-corrected ultra-sensitive ctDNA next-generation sequencing assay |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |