CN116312781B - Genome instability assessment method and system based on machine learning - Google Patents

Genome instability assessment method and system based on machine learning Download PDF

Info

Publication number
CN116312781B
CN116312781B CN202310558775.XA CN202310558775A CN116312781B CN 116312781 B CN116312781 B CN 116312781B CN 202310558775 A CN202310558775 A CN 202310558775A CN 116312781 B CN116312781 B CN 116312781B
Authority
CN
China
Prior art keywords
genome
loh
length
fragments
instability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310558775.XA
Other languages
Chinese (zh)
Other versions
CN116312781A (en
Inventor
季序我
孙天齐
李哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Pukang Ruiren Medical Laboratory Co ltd
Predatum Biomedicine Suzhou Co ltd
Precision Scientific Technology Beijing Co ltd
Original Assignee
Beijing Pukang Ruiren Medical Laboratory Co ltd
Predatum Biomedicine Suzhou Co ltd
Precision Scientific Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Pukang Ruiren Medical Laboratory Co ltd, Predatum Biomedicine Suzhou Co ltd, Precision Scientific Technology Beijing Co ltd filed Critical Beijing Pukang Ruiren Medical Laboratory Co ltd
Priority to CN202310558775.XA priority Critical patent/CN116312781B/en
Publication of CN116312781A publication Critical patent/CN116312781A/en
Application granted granted Critical
Publication of CN116312781B publication Critical patent/CN116312781B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a genome instability assessment method and a genome instability assessment system based on machine learning, wherein the method comprises the following steps: collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample; dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model; training a genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes; the genomic instability is assessed based on a plurality of indicators of genomic instability. The invention adopts a more complex and accurate machine learning model algorithm to replace the original direct addition algorithm; modeling criteria include BRCA1/2 and other HRR genes that have good performance in terms of mutation rate, correlation with genomic instability, correlation with drug efficacy and thus can be incorporated; thereby obtaining better analysis and evaluation effects of genome instability through a precise machine learning modeling method.

Description

Genome instability assessment method and system based on machine learning
Technical Field
The invention relates to the technical field of medical treatment, in particular to a genome instability assessment method and system based on machine learning.
Background
The homologous recombination repair defect (HRD) state is a key index for the treatment selection and prognosis of various tumors, and clinical research results prove that the HRD state is highly related to the sensitivity of platinum chemotherapeutic drugs and PARP inhibitors. HRD detection has been currently approved by the FDA as a concomitant diagnostic marker for ovarian cancer patients using olaparib and nilaparil. Olaparib is the first PARP inhibitor on the market worldwide and domestically, and is approved in patients with ovarian cancer, breast cancer, prostate cancer, pancreatic cancer, and the like, respectively. HRD is reported to be present in every two patients among ovarian cancer patients. Compared with BRCA mutation, the HRD detection can improve the sensitivity of PARP inhibitor to the population. Complex DNA repair systems exist in normal cells, including PARP (polymerase apyrase) to repair DNA single strand breaks and the DNA Homologous Recombination Repair (HRR) pathways where proteins such as BRCA1, BRCA2 and PALB2 repair DNA double strand breaks. Homologous Recombination Repair (HRR) is an important mechanism for DNA double strand break repair. In the DNA Double Strand Break Repair (DSBR) pathway, BRCA1 and BRCA2 genes are two key genes of the homologous recombination repair pathway, and if BRCA1 or BRCA2 is mutated to cause loss of protein function, it will cause HRD which is a defect in homologous recombination repair function, and furthermore, mutation of these genes or methylation of BRCA1 gene promoter to cause HRD, it will cause genomic instability, which is manifested as "genomic scar", including LOH (loss of genomic heterozygosity), TAI (telomere allele imbalance) and LST (large fragment migration).
PARP (poly (adenosine diphosphate) ribose polymerase) is an enzyme critical in DNA single strand break repair, responsible for DNA single strand damage repair. If the PARP inhibitor is used to block the single-stranded repair function of DNA, cells carrying single-stranded mutations will cause double-stranded DNA breaks after replication and proliferation, and if the cells have Homologous Repair Defects (HRDs) at the same time, a large number of double-stranded breaks cannot be repaired and the cells die. This mechanism of action of PARP inhibitors is known as the "synthetic lethal" effect.
HRD results in genomic instability, and HRD score detection, which is manifested as "genomic scarring", is a currently accepted method of assessing HRD status. The HRD score integrates LOH, LST, TAI indexes to score the genome instability, and specific values are obtained by detecting and calculating single nucleotide polymorphism Sites (SNP) in cells. LOH, LST, TAI are all independent predictors of genome stability, and HRD scores (HRD score) are obtained by simple addition of these three indices, and determining HRD score thresholds by 95% recognition sensitivity to BRCA1/2 biallelic inactivation is a current common practice to reflect the state of genome instability.
However, the current common practice has the following technical drawbacks:
(1) Three methods for calculating the gene instability evaluation index LOH, LST, TAI have been developed for many years, and based on the project and scientific experience of many years, the genome scar index has room for improvement in terms of quantity and definition;
(2) The method of directly adding the gene instability evaluation indexes to obtain the HRD score is simple and direct, but cannot accurately obtain a better analysis effect;
(3) Modeling is trained by using BRCA1/2 bi-allelic inactivation as a standard, and the contribution of other genes of the HRR channel to homologous recombination function deletion is not considered, so that when other HRR related genes are mutated or gene promoters are methylated, the condition of unstable genome is not in the evaluation range of instability, and the evaluation result is not accurate enough.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a genome instability assessment method and a genome instability assessment system based on machine learning, which are combined with project and scientific experience and latest scientific achievements in the field of genome instability, and redesign indexes for assessing genome instability so as to enable the indexes to be more comprehensive and detailed; attempting more complex and accurate machine learning model algorithms to replace the original direct addition algorithm; the modeling standard is selected to discover other important HRR genes which can be included besides BRCA1/2, so that the method has good performances in mutation rate, correlation with genome instability and correlation with drug efficacy, obtains better genome instability analysis and evaluation effects through a more accurate machine learning modeling method, and is particularly suitable for patients needing further evaluation of HDR states when the detection result of the BRCA1/2 is negative.
The first aspect of the present invention provides a machine learning-based genome instability assessment method, comprising:
s1, collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
s2, dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model;
s3, training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes;
s4, evaluating the genome instability based on a plurality of genome instability indexes.
Preferably, the genomic sample comprises a fresh blood sample, a paraffin section sample and/or a fresh tissue sample; the process comprises: and (3) performing tumor content assessment, DNA extraction, quality inspection, library construction, capturing and on-machine sequencing on the biological sample.
Preferably, the training set and the verification set in S2 are independent of each other, and the sample sizes of the training set and the verification set are between 450-500.
Preferably, the modeling in S2 to obtain a genome instability assessment model based on the training set and the validation set includes modeling using ridge regression.
Preferably, the plurality of HRR genes in S3 include: HRR3 gene set of three genes consisting of BRCA1, BRCA2 and RAD 51D.
Preferably, the modeling criteria in S3 include: any one of a first set of three conditions is satisfied, defined as a sample in the model being a true positive sample, the first set of three conditions comprising:
(1) BRCA1 bi-allelic inactivation;
(2) BRCA2 bi-allelic inactivation;
(3) RAD51D bi-allelic inactivation;
wherein a gene within the genome satisfies any one of a second set of three conditions, defined as the biallelic inactivation, comprising:
(1) One allele is a 4/5 class mutation, and the other allele is LOH;
(2) Two 4/5 type mutations occurred in the same gene;
(3) One allele is a 4/5 type mutation and the other allele is hypermethylated.
Preferably, the plurality of genome instability indexes in S4 include:
alleles (alleles) are divided into three classes: allelic equilibrium but amplification, non-LOH but allelic imbalance and LOH; simultaneously dividing three types of alleles into five length intervals according to absolute lengths: 0-5M,5-10M,10-15M,15-20M, >20M; wherein an allele (allele) is a pair of genes that control relative traits at the same position on a pair of homologous chromosomes;
the plurality of genome instability indexes comprise 19 genome instability indexes, which are respectively:
(1) b_0-5M: the number of fragments with the length of 0-5M (inclusive) with balanced alleles but amplified;
(2) b_5-10M: the number of fragments with length of 5 (no) -10M (inclusive) that are balanced but amplified by the allele;
(3) b_10-15M: the number of fragments with length of 10 (no) -15M (inclusive) that are balanced but amplified by the alleles;
(4) b_15-20M: the number of fragments with length of 15 (no) -20M (inclusive) that are balanced but amplified by the alleles;
(5) b_ >20M: the number of fragments with a length of more than 20M, balanced but amplified alleles;
(6) imb _0-5M: a number of fragments of length 0-5M (inclusive) that are not LOH but are allelic imbalanced;
(7) imb _5-10M: a number of fragments of length 5 (no) to 10M (inclusive) that are non-LOH but allelic imbalances;
(8) imb _10-15M: a number of fragments of length 10 (none) -15M (inclusive) that are non-LOH but allelic imbalances;
(9) imb _15-20M: a number of fragments of length 15 (no) to 20M (inclusive) that are non-LOH but allelic imbalances;
(10) imb _ >20M: a number of fragments that are not LOH but are allelic imbalanced, greater than 20M in length;
(11) loh _0-5M: LOH, the number of fragments with length of 0-5M (inclusive);
(12) loh _5-10M: LOH, number of fragments ranging in length from 5 (inclusive) to 10M (inclusive);
(13) loh _10-15M: LOH, number of fragments ranging in length from 10 (inclusive) to 15M (inclusive);
(14) loh _15-20M: LOH, number of fragments ranging in length from 15 (inclusive) to 20M (inclusive);
(15) loh _ >20M: LOH, number of fragments greater than 20M in length;
(16) purity: tumor cell fraction;
(17) ploidy: tumor cell genome ploidy;
(18) si: for measuring the heterogeneity of the state of anomaly CN (aberrant CN), the acquisition method includes: fragments (segments) with all alleles not 1:1 were counted; weighting the allelic state of the fragment by its length; calculating a diversity index of allelic states in which copy number variation (Copy Number Variation, CNV) occurs throughout the sample;
(19) hlamp: the acquisition method comprises the following steps: fragments (fragments) located in the high amplification region (including 1q21.1-24.1, 1q42.2-44, 8q11.21-24.3 and 10p15.3-14) were calculated, and the allele status of copy number variation (Copy Number Variation, CNV). Gtoreq.5 was used as a proportion of each fragment in the region.
In a second aspect of the present invention, there is provided a machine learning-based genome instability assessment system comprising:
the sample collection module is used for collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
the model modeling module is used for dividing the genome sample into a training set and a verification set, and modeling is carried out based on the training set and the verification set to obtain a genome instability assessment model;
the model training module is used for training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes;
and the instability evaluation module is used for evaluating the genome instability based on a plurality of genome instability indexes.
A third aspect of the invention provides an electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being for reading the instructions and performing the method according to the first aspect.
A fourth aspect of the invention provides a computer readable storage medium storing a plurality of instructions readable by a processor and for performing the method of the first aspect.
The genome instability assessment method, system and electronic equipment based on machine learning provided by the invention have the following beneficial effects:
(1) The original 3 genome instability indexes are improved to 19, the evaluation of genome instability is more comprehensive and detailed, and the genome instability can be evaluated more accurately.
(2) The modeling method of ridge regression is adopted in the modeling method, and compared with a simple index addition algorithm, the modeling method is more accurate. The training set and the verification set adopted by the modeling group are mutually independent, and the sample size is between 450 and 500, so that the accuracy of the model is improved.
(3) On modeling standard, RAD51D is added except for classical BRCA1/2 genes, so that a HRR3 gene set of 3 genes is formed; meanwhile, more gene sets formed by genes can be formed according to the requirement, and the analysis result shows that RAD51D is an HRR gene which deserves consideration in terms of mutation rate, association with genome instability and association with drug efficacy. The sensitivity of the genome instability assessment method can reach about 92% and the specificity is about 40% (representing that about 60% of patients with potential better curative effects from PARPi maintenance treatment can be screened out by the method in non-HRR 3 bi-allelic mutant population) through analysis performance verification.
(4) Clinical performance shows that the genome instability assessment method has slightly better distinguishing ability for the first-line PARPi maintenance treatment effect than the prior art.
Drawings
FIG. 1 is a schematic flow chart of a genome instability assessment method based on machine learning according to the present invention.
Fig. 2 is a schematic block diagram of a genome instability evaluation system based on machine learning according to the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.
The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.
The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.
The display screen is used for displaying a user interface of each application program.
In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.
Example 1
As shown in fig. 1, the present embodiment provides a genome instability assessment method based on machine learning, which includes the following steps.
S1, collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample. As a preferred embodiment, the genomic sample comprises a fresh blood sample, a paraffin section sample and/or a fresh tissue sample.
In this embodiment, collecting and receiving a fresh blood sample includes: anticoagulation of 2ml with EDTA based on the detection criteria, collecting and receiving a fresh blood sample; collecting and receiving paraffin section samples includes: based on the detection standard, the thickness is 5 mu m and is larger than 1cm 2 At least 6 and at least 10 samples of paraffin sections are collected and received from the surgical tissue and the penetrated tissue; collecting and receiving a fresh tissue sample includes: and (3) respectively placing the fresh tissue samples with the detection standard into a 10% formalin preservation tube and an RNAlater preservation tube for inspection, and collecting and receiving the fresh tissue samples, wherein 1 tissue per tube is more than or equal to 1cm.
As a preferred embodiment, the process comprises: and (3) performing tumor content assessment, DNA extraction, quality inspection, library construction, capturing and on-machine sequencing on the biological sample.
In this embodiment, wherein: assessing tumor content of the biological sample includes: the tumor tissue is required to be subjected to tumor content evaluation, and the qualified standard of the tumor content is more than or equal to 20 percent. If the tumor content is more than or equal to 10% and less than 20%, risk detection can be performed, and if the tumor content is less than 10%, detection is stopped.
DNA extraction and quality inspection include: (1) For a fresh blood sample or a fresh tissue sample, the total quantity of the Qubit detection DNA is more than or equal to 200ng, and the main band of the gel electrophoresis detection DNA is more than or equal to 10Kbp; (2) The total amount of DNA required for FFPE (Formalin Fixed Paraffin Embedded, formalin-fixed paraffin embedded) or paraffin blocks formed for paraffin section samples met the minimum library inventory initiation amount at one time; this is because, in particular, FFPE samples can preserve tissues at normal temperature for a long time or can be used for preparing tissue specimens required for examination (preparation method: to preserve the integrity of cellular structures, formalin is used for fixing tissue samples first, and paraffin is used for embedding, so that FFPE samples are formed by conveniently slicing tissue samples), which are easy to obtain a lower amount of nucleic acid, cannot exert its own value to the greatest extent.
Library construction and capture included: constructing and capturing a library, and ensuring that the concentration of the library is more than or equal to 0.5 ng/. Mu.l. The on-machine sequencing comprises the following steps: and (3) performing on-machine sequencing based on a sequencing platform Novaseq6000, and ensuring that the off-machine sequencing amount of tumor tissues is 6G, and comparing the off-machine sequencing amount with 2G.
S2, dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model. In this embodiment, the training set and the validation set are independent of each other, and the sample sizes of the training set and the validation set are between 450-500. In this embodiment, the modeling based on the training set and the verification set to obtain the genome instability evaluation model includes modeling by using ridge regression, which is more accurate than a simple index addition algorithm. The ridge regression is essentially a regression method for improving the common least square method and giving up the unbiasedness of the least square method, obtaining the regression coefficient at the cost of losing part of information and reducing accuracy and more conforming to the actual condition of the data set, and the fitting property of the ridge regression method on the data set with deviation data is obviously better than that of a linear regression model using the least square method.
Of course, those skilled in the art will appreciate that the modeling includes a classification loop configured as a machine learning classifier, which may also select one of a Linear Discriminant Analysis (LDA) classifier, a Quadratic Discriminant Analysis (QDA) classifier, a Support Vector Machine (SVM) classifier, a Random Forest (RF) classifier, a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, an elastic network algorithm classifier, a sequence minimum optimization algorithm classifier, a naive bayes algorithm classifier, and an NMF predictor algorithm classifier other than ridge regression.
And S3, training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes. In this example, as a preferred embodiment, the plurality of genes includes: HRR3 gene set of three genes consisting of BRCA1, BRCA2 and RAD 51D. Based on the internal analysis results, displaying: mutation of RAD51D or methylation of BRCA1 gene promoter causes HRD, resulting in genome instability, and RAD51D is an HRR gene worth considering from the perspective of mutation rate, association with genome instability, and association with drug efficacy.
Of course, one skilled in the art could also select one or more of PALB2, CDK12, RAD51C, CHEK2 and ATM, together with BRCA1, BRCA2 and RAD51D, to construct a new genome.
As a preferred embodiment, the modeling criteria include: any one of a first set of three conditions is satisfied, defined as a sample in the model being a true positive sample, the first set of three conditions comprising: (1) BRCA1 biallelic inactivation; (2) BRCA2 bi-allelic inactivation; (3) RAD51D biallelic inactivation.
Wherein a gene within the genome satisfies any one of a second set of three conditions, defined as the biallelic inactivation, comprising: (1) One allele is a 4/5 class mutation, and the other allele is LOH; (2) two class 4/5 mutations have occurred in the same gene; (3) One allele is a 4/5 type mutation and the other allele is hypermethylated.
S4, evaluating the genome instability based on a plurality of genome instability indexes. As a preferred embodiment, the plurality of genome unstable indicators comprises: alleles (alleles) are divided into three classes: allelic equilibrium but amplification, non-LOH but allelic imbalance and LOH; simultaneously dividing three types of alleles into five length intervals according to absolute lengths: 0-5M,5-10M,10-15M,15-20M, >20M; wherein an allele (allele) is a pair of genes that control relative traits at the same position on a pair of homologous chromosomes.
In this embodiment, the plurality of genome unstable indexes includes 19 genome unstable indexes, which are respectively: (1) b_0-5M: the number of fragments with the length of 0-5M (inclusive) with balanced alleles but amplified; (2) b_5-10M: the number of fragments with length of 5 (no) -10M (inclusive) that are balanced but amplified by the allele; (3) b_10-15M: the number of fragments with length of 10 (no) -15M (inclusive) that are balanced but amplified by the alleles; (4) b_15-20M: the number of fragments with length of 15 (no) -20M (inclusive) that are balanced but amplified by the alleles; (5) b_ >20M: the number of fragments with a length of more than 20M, balanced but amplified alleles; (6) imb _0-5M: a number of fragments of length 0-5M (inclusive) that are not LOH but are allelic imbalanced; (7) imb _5-10M: a number of fragments of length 5 (no) to 10M (inclusive) that are non-LOH but allelic imbalances; (8) imb _10-15M: a number of fragments of length 10 (none) -15M (inclusive) that are non-LOH but allelic imbalances; (9) imb _15-20M: a number of fragments of length 15 (no) to 20M (inclusive) that are non-LOH but allelic imbalances; (10) imb _ >20M: a number of fragments that are not LOH but are allelic imbalanced, greater than 20M in length; (11) loh _0-5M: LOH, the number of fragments with length of 0-5M (inclusive); (12) loh _5-10M: LOH, number of fragments ranging in length from 5 (inclusive) to 10M (inclusive); (13) loh _10-15M: LOH, number of fragments ranging in length from 10 (inclusive) to 15M (inclusive); (14) loh _15-20M: LOH, number of fragments ranging in length from 15 (inclusive) to 20M (inclusive); (15) loh _ >20M: LOH, number of fragments greater than 20M in length; (16) purity: tumor cell fraction; (17) ploidy: tumor cell genome ploidy; (18) si: for measuring the heterogeneity of the state of anomaly CN (aberrant CN), the acquisition method includes: fragments (segments) with all alleles not 1:1 were counted; weighting the allelic state of a segment (segment) by the length of the segment (segment); calculating a diversity index of allelic states in which copy number variation (Copy Number Variation, CNV) occurs throughout the sample; (19) hlamp: the acquisition method comprises the following steps: fragments (fragments) located in the high amplification region (including 1q21.1-24.1, 1q42.2-44, 8q11.21-24.3 and 10p15.3-14) were calculated, and the allele status of copy number variation (Copy Number Variation, CNV). Gtoreq.5 was used as a proportion of each fragment in the region.
The experimental operation part of this example is as follows:
1. and (5) collecting and receiving samples.
1. Fresh blood sample: the detection standard is EDTA anticoagulation 2ml.
2. Paraffin section samples: the detection standard is 5 μm thick, more than 1cm 2 At least 6 pieces of surgical tissue, and at least 10 pieces of penetrating tissue.
3. Fresh tissue samples: the tissue is respectively put into a 10% formalin preservation tube and an RNAlater preservation tube for inspection, and 1 piece of tissue per tube is more than or equal to 1cm.
2. And (5) assessing tumor content.
The tumor tissue is required to be subjected to tumor content evaluation, and the qualified standard of the tumor content is more than or equal to 20 percent. If the tumor content is more than or equal to 10% and less than 20%, risk detection can be performed, and if the tumor content is less than 10%, detection is stopped.
3. And (5) extracting sample DNA and detecting quality.
1. Blood or fresh tissue needs to meet the requirement that the total quantity of the Qubit detection DNA is more than or equal to 200ng, and the main band of the gel electrophoresis detection DNA is more than or equal to 10Kbp.
2. The total amount of FFPE/wax block required DNA meets the minimum initial amount of library establishment once.
4. Library construction and capture: the concentration of the library to be discharged is more than or equal to 0.5 ng/. Mu.l.
5. Sequencing on a machine: the sequencing platform Novaseq6000, the tumor tissue off-machine sequencing amount is 6G, and the control off-machine sequencing amount is 2G.
The raw letter analysis section of the present embodiment:
1. the data analysis is started after the data analysis personnel receives the data management data off-line notification. 2. And extracting the item numbers, the corresponding subject screening numbers and the corresponding data paths according to the data docking table provided by the data manager, and writing the item numbers, the corresponding subject screening numbers and the corresponding data paths into a standard input format required by an automatic analysis flow. 3. The data analysis flow is started, the data analysis process is generally completed in 5-8 hours, after the completion, a letter generation analyst needs to check analysis quality control results, and the judgment is carried out by combining with a control standard: if the quality control is passed, the report analysis record needs to be filled in. If analysis is interrupted or quality control is not passed, the analysis is processed in an exception handling mode. 4. Exception handling: if analysis is interrupted, the cause of the interruption of the analysis flow is confirmed first. If the analysis is interrupted due to external factors, such as power failure of an external machine room, hardware faults and the like, after the faults are removed, the intermediate file folder is renamed and analyzed again according to the analysis flow. The original folder name is renamed as "original folder name-number of analysis". If the quality control is not passed, the data manager is contacted, the complement measurement or the retest is carried out, and the analysis is restarted. 5. A report is generated.
Example two
Referring to fig. 2, the present embodiment provides a machine learning-based genome instability assessment system, comprising: a sample collection module 101, configured to collect and receive a biological sample, and process the biological sample to obtain a genomic sample; the model modeling module 102 is configured to divide the genome sample into a training set and a verification set, and perform modeling based on the training set and the verification set to obtain a genome instability assessment model; a model training module 103 for training the genome instability assessment model based on a set of genes formed by a plurality of HRR genes to form a modeling standard; and a instability assessment module 104 for assessing genomic instability based on a plurality of genomic instability indicators.
The system may implement the evaluation method provided in the first embodiment, and the specific evaluation method may be referred to the description in the first embodiment, which is not repeated here.
The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.
As shown in fig. 3, the present invention further provides an electronic device, including a processor 301 and a memory 302 connected to the processor 301, where the memory 302 stores a plurality of instructions, and the instructions may be loaded and executed by the processor, so that the processor can execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (9)

1. A machine learning-based method for assessing genomic instability, comprising:
s1, collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
s2, dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model;
s3, training the genome instability assessment model based on a modeling standard formed by a gene set formed by a plurality of HRR genes; the modeling criteria in S3 include: any one of a first set of three conditions is satisfied, defined as a sample in the model being a true positive sample, the first set of three conditions comprising:
(1) BRCA1 bi-allelic inactivation;
(2) BRCA2 bi-allelic inactivation;
(3) RAD51D bi-allelic inactivation;
wherein a gene within the genome satisfies any one of a second set of three conditions, defined as the biallelic inactivation, comprising:
(1) One allele is a 4/5 class mutation, and the other allele is LOH;
(2) Two 4/5 type mutations occurred in the same gene;
(3) One allele is a 4/5 class mutation, and the other allele is hypermethylation state;
s4, evaluating the genome instability based on a plurality of genome instability indexes.
2. The machine learning based genomic instability assessment method of claim 1, wherein the genomic samples comprise fresh blood samples, paraffin section samples and/or fresh tissue samples; the process comprises: and (3) performing tumor content assessment, DNA extraction, quality inspection, library construction, capturing and on-machine sequencing on the biological sample.
3. The machine learning based genome instability assessment method of claim 2, wherein the training set and the validation set are independent of each other in S2 and the sample size of the training set and the validation set is between 450-500.
4. A machine learning based genome instability assessment method according to claim 3, wherein the modeling based on the training set and the validation set in S2 to obtain a genome instability assessment model comprises modeling using ridge regression.
5. The machine learning based genome instability assessment method of claim 4, wherein the plurality of HRR genes in S3 comprises: HRR3 gene set of three genes consisting of BRCA1, BRCA2 and RAD 51D.
6. The machine learning based genome instability assessment method of claim 5, wherein the plurality of genome instability indices in S4 comprise:
alleles are divided into three categories: allelic equilibrium but amplification, non-LOH but allelic imbalance and LOH; simultaneously dividing three types of alleles into five length intervals according to absolute lengths: 0-5M,5-10M,10-15M,15-20M, >20M; wherein alleles are a pair of genes that control relative traits at the same position on a pair of homologous chromosomes;
the plurality of genome instability indexes comprise 19 genome instability indexes, which are respectively:
(1) b_0-5M: the number of fragments with the length of 0-5M with balanced alleles but amplified; wherein the fragment has a length of 5M;
(2) b_5-10M: the number of fragments with the length of 5-10M which are balanced but amplified by the alleles; wherein the fragment is 5M free but 10M long;
(3) b_10-15M: the number of fragments with the length of 10-15M with balanced alleles but amplified; wherein the fragment is 10M free but 15M long;
(4) b_15-20M: the number of fragments with the length of 15-20M is balanced but amplified by the alleles; wherein the fragment is not 15M long but 20M long;
(5) b_ >20M: the number of fragments with a length of more than 20M, balanced but amplified alleles;
(6) imb _0-5M: number of fragments of length 0-5M that are non-LOH but allelic imbalances; wherein the fragment has a length of 5M;
(7) imb _5-10M: a number of fragments that are not LOH but are allelic imbalanced, ranging in length from 5 to 10M; wherein the fragment is 5M free but 10M long;
(8) imb _10-15M: a number of fragments that are not LOH but are allelic imbalanced, ranging in length from 10 to 15M; wherein the fragment is 10M free but 15M long;
(9) imb _15-20M: a number of fragments that are not LOH but are allelic imbalanced, ranging in length from 15 to 20M; wherein the fragment is not 15M long but 20M long;
(10) imb _ >20M: a number of fragments that are not LOH but are allelic imbalanced, greater than 20M in length;
(11) loh _0-5M: LOH, number of fragments with length of 0-5M; wherein the fragment has a length of 5M;
(12) loh _5-10M: LOH, the number of fragments with the length of 5-10M; wherein the fragment is 5M free but 10M long;
(13) loh _10-15M: LOH, the number of fragments with the length of 10-15M; wherein the fragment is 10M free but 15M long;
(14) loh _15-20M: LOH, the number of fragments with the length of 15-20M; wherein the fragment is not 15M long but 20M long;
(15) loh _ >20M: LOH, number of fragments greater than 20M in length;
(16) purity: tumor cell fraction;
(17) ploidy: tumor cell genome ploidy;
(18) si: the acquisition method for measuring the heterogeneity of the state of the averrantCN comprises the following steps: segments were counted for all alleles that were not 1:1; weighting the allele state of segments by their length; calculating the diversity index of the allelic state of the copy number variation CNV in the whole sample;
(19) hlamp: the acquisition method comprises the following steps: calculating segments in a high amplification region, wherein the proportion of allele states with copy number variation CNV more than or equal to 5 in each segment in the region; wherein the high amplification region comprises 1q21.1-24.1, 1q42.2-44, 8q11.21-24.3 and 10p15.3-14.
7. A machine learning based genome instability assessment system for implementing the assessment method according to any of claims 1 to 6, comprising:
a sample collection module (101) for collecting and receiving a biological sample, and processing the biological sample to obtain a genome sample;
a model modeling module (102) for dividing the genome sample into a training set and a verification set, and modeling based on the training set and the verification set to obtain a genome instability assessment model;
a model training module (103) for training the genome instability assessment model based on a set of genes formed by a plurality of HRR genes to form a modeling standard;
and a instability assessment module (104) for assessing genomic instability based on the plurality of genomic instability indicators.
8. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the instructions and perform the evaluation method of any one of claims 1-6.
9. A computer readable storage medium storing a plurality of instructions readable by a processor and for performing the evaluation method according to any one of claims 1-6.
CN202310558775.XA 2023-05-17 2023-05-17 Genome instability assessment method and system based on machine learning Active CN116312781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310558775.XA CN116312781B (en) 2023-05-17 2023-05-17 Genome instability assessment method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310558775.XA CN116312781B (en) 2023-05-17 2023-05-17 Genome instability assessment method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN116312781A CN116312781A (en) 2023-06-23
CN116312781B true CN116312781B (en) 2023-08-18

Family

ID=86817137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310558775.XA Active CN116312781B (en) 2023-05-17 2023-05-17 Genome instability assessment method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN116312781B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164420A (en) * 2020-09-07 2021-01-01 厦门艾德生物医药科技股份有限公司 Method for establishing genome scar model
CN112164422A (en) * 2020-10-12 2021-01-01 郑州大学第一附属医院 Grading method for quantifying TIME infiltration mode
CN112930569A (en) * 2018-08-31 2021-06-08 夸登特健康公司 Microsatellite instability detection in cell-free DNA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4360094A1 (en) * 2021-06-25 2024-05-01 Foundation Medicine, Inc. System and method of classifying homologous repair deficiency

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112930569A (en) * 2018-08-31 2021-06-08 夸登特健康公司 Microsatellite instability detection in cell-free DNA
CN112164420A (en) * 2020-09-07 2021-01-01 厦门艾德生物医药科技股份有限公司 Method for establishing genome scar model
CN112164422A (en) * 2020-10-12 2021-01-01 郑州大学第一附属医院 Grading method for quantifying TIME infiltration mode

Also Published As

Publication number Publication date
CN116312781A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
Bejar Clinical and genetic predictors of prognosis in myelodysplastic syndromes
Park et al. Sequence-based association and selection scans identify drug resistance loci in the Plasmodium falciparum malaria parasite
Caballero et al. The nature of genetic variation for complex traits revealed by GWAS and regional heritability mapping analyses
Zeggini et al. Meta-analysis in genome-wide association studies
KR102638152B1 (en) Verification method and system for sequence variant calling
EP4036247B1 (en) Methods to detect rare mutations and copy number variation
Chen et al. Identification of selective sweeps reveals divergent selection between Chinese Holstein and Simmental cattle populations
Onecha et al. A novel deep targeted sequencing method for minimal residual disease monitoring in acute myeloid leukemia
AU2020398913A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
US20210238695A1 (en) Methods of mast cell tumor prognosis and uses thereof
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN117947163A (en) Method for evaluating background level of variant nucleic acid sample
WO2019046804A1 (en) Identifying false positive variants using a significance model
CN105483210A (en) RNA (ribonucleic acid) editing locus detection method
Piening et al. Whole transcriptome profiling of prospective endomyocardial biopsies reveals prognostic and diagnostic signatures of cardiac allograft rejection
Gaksch et al. Residual disease detection using targeted parallel sequencing predicts relapse in cytogenetically normal acute myeloid leukemia
Yang et al. A systematic comparison of normalization methods for eQTL analysis
CN116312781B (en) Genome instability assessment method and system based on machine learning
CN114990202B (en) Application of SNP (Single nucleotide polymorphism) locus in evaluation of genome abnormality and method for evaluating genome abnormality
CN107885972A (en) It is a kind of based on the fusion detection method of single-ended sequencing and its application
CN114694752B (en) Method, computing device and medium for predicting homologous recombination repair defects
Cook et al. A deep-learning-based RNA-seq germline variant caller
Aguet et al. Transcriptomic signatures across human tissues identify functional rare genetic variation
CN115762800A (en) Scoring system capable of predicting melanoma patient prognosis and immunotherapy response rate
Fettke et al. Analytical validation of an error-corrected ultra-sensitive ctDNA next-generation sequencing assay

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant