CN111640508A - Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application - Google Patents

Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application Download PDF

Info

Publication number
CN111640508A
CN111640508A CN202010469448.3A CN202010469448A CN111640508A CN 111640508 A CN111640508 A CN 111640508A CN 202010469448 A CN202010469448 A CN 202010469448A CN 111640508 A CN111640508 A CN 111640508A
Authority
CN
China
Prior art keywords
gene
pan
tumor
targeted drug
regulation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010469448.3A
Other languages
Chinese (zh)
Other versions
CN111640508B (en
Inventor
李园园
戴文韬
刘继翔
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute Of Biomedical Technology
Original Assignee
SHANGHAI CENTER FOR BIOINFORMATION TECHNOLOGY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI CENTER FOR BIOINFORMATION TECHNOLOGY filed Critical SHANGHAI CENTER FOR BIOINFORMATION TECHNOLOGY
Priority to CN202010469448.3A priority Critical patent/CN111640508B/en
Publication of CN111640508A publication Critical patent/CN111640508A/en
Application granted granted Critical
Publication of CN111640508B publication Critical patent/CN111640508B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Biotechnology (AREA)
  • Pathology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the field of gene detection and bioinformatics, discloses application of a state evaluation model constructed based on high-throughput sequencing data and clinical phenotype in evaluation of a pan-tumor targeted drug susceptibility state, discloses a method for mining a pan-tumor targeted drug marker based on transcriptome data, exome/genomic data and clinical phenotype, designs a set of calculation method for constructing a pan-tumor targeted drug evaluation model by integrating high-throughput sequencing data and clinical phenotype, screens out a biomarker related to tumor patient targeted drug susceptibility, and forms the pan-tumor targeted drug susceptibility state evaluation model. The marker with both accuracy and mechanism explanatory property is constructed by the method, and can be used for pan-tumor treatment effect prediction, treatment scheme auxiliary decision and the like.

Description

Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application
Technical Field
The invention relates to the technical field of gene detection and bioinformatics, in particular to a method for establishing pan-tumor targeted drug susceptibility state evaluation based on high-throughput sequencing data and clinical phenotypes, and a related detection panel design and implementation application case thereof.
Background
The first generation sequencing technology obtains base information of specific positions of a sequence by a dideoxy end termination method or a chemical cutting method, and reads a nucleic acid sequence by electrophoresis and development. The gene chip technology realizes high-throughput parallelization by a method for carrying out nucleic acid sequence determination by hybridizing with a group of nucleic acid probes with known sequences, and has the defects that the repeatability and the sensitivity are to be enhanced, and the analysis range is not wide enough. The second-generation sequencing technology, also called next-generation sequencing (NGS), is different from the first-generation sequencing technology, realizes high-throughput parallel sequencing by in vitro fragment amplification and sequencing while synthesis, and has the main defect of short reading length. The third generation sequencing technology, also called single molecule sequencing technology, directly reads the template sequence information without the limitation of reading length by detecting the fluorescent signal or electric signal of the template sequence without amplification. High-throughput sequencing data (generated by a second generation or third generation sequencing technology) can detect mutation at a high throughput on a DNA level, including point mutation, insertion deletion mutation, gene fusion, copy number variation and the like, and can detect gene quantitative expression level, variable gene shearing and fusion and the like at a high throughput on an RNA level, thereby playing an important promoting role in the development of precise medicine.
Complex diseases represented by tumors, cardiovascular and cerebrovascular diseases and metabolic diseases are great threats to human health, and the research on the pathogenic mechanism of the complex diseases is benefited by the rapid development of biotechnology at present, so that the research has great progress. Based on high-throughput sequencing data of a complex disease sample, the rules of occurrence, progression, outcome, treatment and prognosis of the complex disease can be explained from a molecular level, the tumor state can be effectively evaluated in an auxiliary manner, and guidance is provided for formulating an accurate and effective treatment scheme. Tumors are typical representatives of complex diseases, and when detectable mutation or abnormal expression genes caused by the tumors are closely related to the clinical phenotype of a specific tumor, the tumors can be used as molecular tumor markers for diagnosis, risk assessment, prognosis, treatment guidance, progression, safety assessment and the like.
The complex disease marker discovery technology and related marker detection and evaluation scheme based on high-throughput sequencing data have advanced greatly, but still face the following disadvantages to be overcome: 1) the marker excavation method is relatively simple, and the accuracy and the interpretability need to be enhanced urgently. For complex diseases related to multiple genes, a marker based on a single gene is difficult to achieve high accuracy; there is much less of a mechanistic interpretative concern about markers than improving accuracy. The method does not accord with the evidence-based medical concept and understand the key principle of the marker, and is not beneficial to realizing the theoretical optimal combination of the marker, thereby improving the robustness and the repeatability of the marker. 2) The detection and evaluation contents are relatively single and have limited functions. At present, because of the gene collection and screening capacity and the sequencing cost, the gene covered by the same marker detection scheme is relatively less, single-site or small-fragment mutation is taken as a main evaluation index in practical application, and recently, the scheme taking the gene expression level and the overall mutation level of all genes in the detection panel as marker evaluation is increasingly concerned; in the aspect of function, the prediction of the effect of site or gene related targeted drugs is taken as the main point, and the guiding significance for wider operations, chemotherapy, radiotherapy, immunotherapy and the like is limited. 3) The marker design and the matched data analysis tool are not sufficient in utilization of the multivariate information. Most of the current design schemes only aim at drug guidelines, labels and limited literature collection, the technical route focuses on the single omics level, the comprehensive analysis is less based on large-scale sequencing results, public databases and text mining technologies, and the integration analysis of the multivariate data covering various molecular omics and clinical phenotype information is seriously insufficient.
Disclosure of Invention
In order to solve the problems, the invention provides a method for mining a pan-tumor targeted drug susceptibility marker based on transcriptome data, exome/genome data and clinical phenotype, designs a set of calculation method for constructing a pan-tumor targeted drug susceptibility state evaluation model by integrating high-throughput sequencing data and clinical phenotype, applies the calculation method to pan-tumor, screens biomarkers related to targeted drug susceptibility of tumor patients, and forms the pan-tumor targeted drug susceptibility state evaluation model.
The pan-tumor targeted drug sensitivity marker excavated by the method provided by the invention has both marker accuracy and mechanism interpretability; the construction method of the pan-tumor targeted drug susceptibility state evaluation model provided by the invention has the advantages of full utilization of multivariate information, rich evaluation indexes, comprehensive and practical functional system, and the design of screening, mining, modeling, grading, detecting panel and the like. The technical innovation can be specifically implemented in excavation of pan-tumor targeted drug susceptibility markers and construction of a state evaluation model.
The invention provides a method for mining complex disease markers based on transcriptome data, exome data and clinical phenotype, which comprises the following steps:
step 1) classifying and sorting complex disease case information:
step 1.1) dividing the complex disease case information into transcriptome data, exome/genomic data and clinical information;
step 1.2) classifying the complex disease case information according to disease states and carrying out pairing and sorting; the above information classification will be used for the selection of the three method modes of step 2.
Step 2), constructing a complex disease marker combination, and performing combined optimization screening by using successive iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm:
if the complex disease case information only relates to transcriptome data and clinical information, executing step 2.1) carrying out marker mining based on the transcriptome data and the clinical information to construct a gene abnormality regulation and control relation marker combination related to the complex disease;
if the complex disease case information only relates to the exome/genome data and the clinical information, executing the step 2.2) carrying out marker mining based on the exome/genome data and the clinical information to construct a gene variation marker combination related to the complex disease;
and if the complex disease case information simultaneously comprises transcriptome data, exome/genome data and clinical information, executing the step 2.3) carrying out marker mining on the basis of the transcriptome data, the exome/genome data and the clinical information to construct a gene abnormality regulation relation and a gene variation marker combination related to the complex disease.
In particular, said step 2.1) comprises the following sub-steps:
step 2.1.1) constructing a reference gene regulation network: based on the transcription regulation relation information obtained from public data resources and the promoter sequence of the human coding Gene, a relation pair of potential Transcription Factors (TF) and a target Gene (target) is identified, and a Reference Gene Regulation network (rGRN) is constructed.
Step 2.1.2) based on transcriptome expression data under a specific disease state and the TF-target relationship in rGRN, a condition-specific Gene regulatory network (cGNN) under a specific disease state is constructed. In step 2.1.2), a feature selection algorithm based on machine learning is adopted, and the feature selection algorithm comprises Boruta,
Figure BDA0002513819210000031
Bayes, NMF and univariate linear regression, and realizes acceleration by an isomeric calculation or parallelization method, and TFs which significantly contribute to the TF-target relationship under a disease state are screened to form a condition-specific gene regulation network, namely the gene regulation network of a specific disease state.
Step 2.1.3) quantifying the gene regulation intensity in the condition-specific gene regulation network and the regulation intensity difference between networks: quantifying the gene regulation strength in the condition-specific gene regulation network by adopting a multiple linear regression model;
performing regression by a De-biased LASSO method, solving to obtain the regulation and control strength and the confidence interval of each gene regulation and control relationship, and judging whether the regulation and control difference is obvious or not by comparing whether the confidence intervals of the same regulation and control relationship in the gene regulation and control networks with different condition specificities are overlapped or not; or the intensity mean value change of the same regulation relation in the gene regulation and control network with different specific conditions is compared, the confidence interval does not need to be calculated, and the regulation and control difference is directly quantified.
Step 2.1.4) screening gene abnormal regulation and control relations among condition-specific gene regulation and control networks under different disease states:
integrating three factors related to gene regulation and screening the gene abnormal regulation and control relation among condition-specific gene regulation and control networks under different disease states, comprising the following steps: the gene regulation intensity is obviously changed, the expression level of the regulation target gene is obviously changed, and the regulation intensity change direction of TF to target is consistent with the change direction of the expression level of target; meanwhile, according to the difference degree of the regulation intensity among different disease states, the screened gene abnormal regulation and control relations are sequenced.
And 2.1.5) constructing a gene abnormal regulation relation marker combination related to a complex disease state (such as a disease progression stage, prognosis and treatment scheme sensitivity) based on the gene abnormal regulation relation, wherein the marker combination can be used for disease progression evaluation, prognosis evaluation and treatment scheme auxiliary decision.
Step 2.1.5 starting from the gene abnormality regulation relationship, a Cox regression model is used to screen marker combinations related to disease states, such as disease progression stage, prognosis, treatment regimen sensitivity. The method comprises the steps of constructing a Cox model of each abnormal regulation gene pair, comparing the C-index of each abnormal regulation gene pair, gradually increasing and iterating the abnormal regulation gene pairs based on a greedy algorithm, and iterating in an evolutionary manner based on a genetic algorithm.
In particular, said step 2.2) comprises the following sub-steps:
step 2.2) marker mining based on exome/genome data and clinical information;
step 2.2.1) identifying genetic variations associated with the complex disease; wherein, the DNA variation related to disease state includes gene copy number and somatic mutation, including but not limited to variation detectable by high throughput sequencing technologies such as single base polymorphism (SNP), insertion and deletion (Indel), Copy Number Variation (CNV), gene fusion (fusion), gene rearrangement (rearrangement), etc.;
step 2.2.2) quantitative screening of important gene variations related to complex disease states is driven by data drive and/or priori knowledge; wherein, the quantitative filtering and screening of data relates to the calculation and sequencing of somatic cell gene variation frequency and the identification of high-frequency variation genes, wherein the genes with the gene variation frequency of more than or equal to 5 percent are further used for filtering priori knowledge; filtering and screening the prior knowledge, wherein the filtering and screening comprise application standards, clinical treatment guidelines, drug labels, general knowledge bases and complex disease-related genes in literature reports;
step 2.2.3) constructing a DNA variation marker combination related to the complex disease state (such as disease progression stage, prognosis and treatment scheme sensitivity) based on the important gene variation related to the complex disease state obtained in the step 2.2.2), wherein the marker combination can be used for disease progression evaluation, prognosis evaluation and treatment scheme assistant decision. Wherein, a Cox regression model is used to screen for DNA variation marker combinations associated with disease states, such as disease progression stage, prognosis, and treatment regimen sensitivity. The method comprises the steps of constructing a Cox model of each variation, comparing the C-index of the Cox model, performing successive-increase iteration on important variations based on a greedy algorithm, and performing evolutionary iteration based on a genetic algorithm. Constructing a gene variation marker combination related to the complex disease by successive incremental iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 2.3) comprises the following sub-steps:
step 2.3.1) for a complex disease data set simultaneously having transcriptome data and exome/genome data, screening gene abnormal regulation and control relations related to the disease state by using the steps 2.1.1-2.1.4, and mining important gene variation related to the disease state by using the steps 2.2.1-2.2.2 to respectively obtain the gene abnormal regulation and control relations and the important gene variation related to the complex disease;
and 2.3.2) subsequently adopting the steps 2.1.5 and 2.2.3, integrating RNA and DNA information based on successive increase iteration of a greedy algorithm or evolution iteration based on a genetic algorithm, and constructing a gene abnormality regulation and control relation and a gene variation marker combination related to the complex disease.
Based on the complex disease marker obtained by the method, the invention provides a complex disease comprehensive state scoring method, which comprises the following steps:
step 3.1) screening clinical information (such as disease progression stage, prognosis, treatment scheme sensitivity) related to complex disease states and detection and pathological indexes aiming at known prior knowledge;
step 3.2) screening clinical information and inspection and pathological indexes related to the complex disease state from case information in the complex disease queue;
step 3.3) combining the gene abnormal regulation and control relation and/or gene variation marker related to the complex disease obtained by the method of the invention, integrating clinical information and inspection and pathological indexes related to the complex disease state obtained by screening in the synchronous steps 3.1 and 3.2, optimizing the complex disease state into a complex disease multi-marker combination, and constructing a complex disease comprehensive state scoring model; the model is used for calculating the complex disease comprehensive state score. The method comprises the steps of integrating, optimizing and simplifying characteristics to form a complex disease multi-marker combination comprising gene abnormality regulation relation, gene variation, clinical information, inspection and case indexes related to the complex disease by utilizing successive increase iteration based on a greedy algorithm and/or evolution iteration based on a genetic algorithm; and further, a comprehensive state scoring model of the complex disease is constructed by utilizing statistical regression and a machine learning algorithm aiming at prognosis evaluation of the complex disease, prediction of treatment effect and auxiliary decision of a treatment scheme.
Specifically, step 3.1 adopts the latest clinical guidelines, expert consensus and recommendation opinions at home and abroad, clinical application guidelines for drugs, clinical practice guidelines from the Chinese clinical oncology society (CSCO), the National Comprehensive Cancer Network (NCCN), the American clinical oncology society (ASCO), the European institute of oncology (ESMO) and the Japanese society of oncology (JSC), and various inspection indexes related to complex diseases in a general knowledge base, combines a body base related to complex diseases and published authority documents, systematically searches and mines the inspection indexes highly related to the progress, the sensitivity of a treatment scheme and the prognosis of the complex diseases, and after redundancy is removed, incorporates subsequent model and tool development.
Specifically, step 3.2 integrates the relevant test indexes for evaluating the complex disease state and clinical information construction model based on the available complex disease queue dataType, using predictive assessment indicators (e.g., C-index, AUC), using machine learning feature selection strategies such as Boruta, Abira,
Figure BDA0002513819210000061
bayes, NMF, univariate linear regression, screening the test index whose effect on the clinical information prediction is in the front.
Specifically, step 3.3 utilizes statistical modeling or machine learning means to train and realize the complex disease state evaluation model by combining the sequencing omics markers, the clinical examination indexes and the indexes screened based on the disease queue information with the clinical information of the case; and aiming at more accurately and reliably predicting the prognosis state of a patient with a complex disease and the benefit condition of a treatment scheme, various indexes (such as survival curve, C-index, AUC and the like) are comprehensively used, the feature combination is simplified (the optimal combination target is that the number of features is small, the accuracy and the reliability are realized, the mechanism interpretability is strong), and the state evaluation model is iteratively optimized.
The invention provides a complex disease comprehensive state scoring computing system, which develops and packages a complex disease comprehensive state scoring model into a complex disease comprehensive state scoring computing system (such as a software and online server form) convenient to use by adopting the complex disease comprehensive state scoring method. The system comprises a practical and convenient input and output module and a scoring model, wherein the output content at least comprises information such as the classification and risk scoring of the complex diseases and corresponding treatment benefit prediction prompts.
The invention provides a design method of gene detection panel, which comprises the following steps:
step 4.1) screening based on the method to obtain gene abnormal regulation relation and/or gene variation marker combination related to the complex disease, finally incorporating the gene set of the complex disease comprehensive state scoring method, combing gene related information in the gene set, removing redundancy, and determining a standard gene name;
step 4.2) aiming at the gene combed in the step 4.1), selecting a target gene target region for complex disease detection design, and using the target gene target region for probe design or primer design;
step 4.3) designing corresponding probe and/or primer sequences according to the target gene target region in the step 4.2), and recording important annotations;
step 4.4) aiming at the target gene target region in the step 4.2), referring to a data set of a probe and/or a primer which can be designed in a human genome, and carrying out optimization design on the target gene target region so that the probe and/or the primer can be uniformly captured and covered on the target region;
step 4.5) comparing the target gene target region related probes and/or primer design regions in the steps 4.3 and 4.4 to obtain a target gene target region related probe and/or primer design scheme with optimal coverage;
step 4.6) based on the target gene target region related probe and/or primer designed in step 4.5, a gene detection panel for fully performing the complex disease state assessment was made.
Specifically, when the gene target region designed by the probe is selected in step 4.2, the principle of precise priority and gradual expansion is adopted, firstly, the variant site region is adopted, the exon region where the variant site is located is suboptimal selected, and finally, all variable shearing regions of the variant gene can be adopted. Wherein, the target gene target region of the probe and/or primer design for detecting the complex disease is selected according to the following principle: for the specific information of the variation site and no other variation site in the range of 100bp before and after the variation site sequence, the defined gene site coverage area is used as the target area of the target gene; for gene regions with more concentrated or dense variant loci, namely two variant loci are adjacent and the interval does not exceed 100bp, selecting exons of the group of variant loci as target regions of target genes; for the important gene with very diverse information determined in the step 4.1), in the case that the first two designs are not applicable, all regions of the gene with variable splicing types are selected as target regions of the target gene.
Specifically, the design in step 4.3) refers to extending both ends of the target region of the target gene in step 4.2), combining all the extended target regions and removing redundancy; recording important information of the target region of the target gene for probe and/or primer design in a suitable file format, including chromosome number of the target region of the target gene, starting position of the target region of the target gene, terminating position of the target region of the target gene, mutation site information, and custom information, such as 3' end information required for primer design.
Specifically, in step 4.4), a probe and/or primer data set can be designed in the human genome, the coverage depth of the probe and/or primer designed in the target gene target region is weighted, and after the coverage depth of the probe and/or primer is predicted based on the human whole genome sequencing data, the whole probe and/or primer data set is adjusted, so that the probe and/or primer can uniformly capture and cover the target region.
Specifically, in step 4.5, the probe design regions generated in step 4.3 and step 4.4 are compared comprehensively, and the coverage of the probes on the important variation sites and all target regions is simultaneously evaluated, so as to obtain a probe design scheme with the optimal coverage. Wherein, the optimal coverage of the probes and/or primers related to the target region of the target gene in the step 4.5) refers to the calculation of the coverage of the probes and/or primers to the important gene variation sites in the step 4.1) and the coverage of all target region of the target gene, and the calculation formula is as follows: coverage-read number on alignment/target sequencing read number; through the optimization near the target region of the target gene, the coverage of the finally designed probe and/or primer on the target region of all the target genes is more than or equal to 90 percent, and the coverage on the important gene variation site in the step 4.1) is more than or equal to 97 percent.
In the invention, the steps 4.1 to 4.6 are a comprehensive flow as a whole, and can be based on the detection platform adopted in the specific detection, such as PCR, NGS, third generation sequencing, NanoString and the like. Aiming at different fields and technical experience specifications, corresponding adjustment and optimization can be carried out.
The invention provides a method for constructing complex disease state evaluation based on high-throughput sequencing data and clinical phenotype, which is used for evaluation based on the combination of complex disease state evaluation gene detection panel and a comprehensive state scoring computing system and comprises the following steps:
step 5.1) based on the gene detection panel designed by the method of the invention, obtaining the quantitative value of the gene abnormal regulation relation and/or gene variation marker combination related to the complex disease, and the complex disease comprehensive state score calculating system of the invention;
step 5.2) inputting the obtained clinical information related to the complex disease state and the quantitative values of the inspection and pathological indexes into the complex disease comprehensive state scoring computing system;
and 5.3) combining hardware, software and/or online tools related to the steps 5.1) and 5.2) into a set of matched combined flow, so that a user can complete detection, information input, calculation evaluation and result acquisition according to requirements, and smoothly obtain effective information such as evaluation state, prompt suggestion output and the like.
In the invention, step 5.1 adopts a mode adapting to specific application requirements, such as a detection device or a kit, to flexibly obtain various omics information including but not limited to copy number, gene variation and gene expression in the DNA and RNA layers, so as to obtain a quantitative value input into a comprehensive state score calculation system as a target and determine a standard input mode.
In the invention, step 5.2 adopts a suitable application scene, and a hardware or software module matched with the gene detection panel in step 5.1 is used for acquiring case detection indexes and clinical information which can be input into a comprehensive state score calculation system from medical information systems such as HIS or EMR in an automatic or manual mode, and determining a standard input mode.
In the invention, the combination method of the gene detection panel and the comprehensive scoring system constructed in the step 5.3 aims at meeting the application requirements, and the combination forms are flexible and various, including but not limited to the forms of a kit/software, a detection device/data processing integrated machine, a kit/detection device/data online analysis platform and the like; the user can input necessary information of individual cases in a most convenient, friendly and efficient form according to the description document, wherein the necessary information comprises gene abnormal regulation and control relation and/or gene variation marker combination related to complex diseases, clinical information related to complex diseases, inspection and pathological indexes, and after data summarization statistics and preprocessing are carried out automatically or semi-automatically, calculation and evaluation are completed, and information such as classification and risk score of the individual cases, corresponding treatment benefit prediction prompt and the like is output; finally, the functions of evaluating the individual case state of the complex disease, assisting clinical decision and the like can be realized.
The method disclosed by the invention is applied to construction of a complex disease state evaluation model based on high-throughput sequencing data and clinical phenotype, including application in screening complex disease comprehensive state evaluation marker combinations; the application in screening tumor comprehensive state evaluation marker combination; the application in the prognosis evaluation of complex diseases, the prediction of treatment effect and the auxiliary decision of treatment schemes.
The invention provides an application of a method for constructing a complex disease state evaluation model based on high-throughput sequencing data and clinical phenotypes in colorectal tumor state evaluation (comprising a colorectal tumor state evaluation model construction method, a colorectal tumor state evaluation panel design method, a colorectal tumor state evaluation method and the like), which comprises the following steps:
step 14.1) acquiring colorectal tumor case information, including high-throughput sequencing data and clinical information, classifying according to colorectal tumor case states, performing pairing arrangement, and determining a mining mode;
step 14.2) constructing a colorectal tumor related gene abnormal regulation relation and a gene variation marker combination;
step 14.3) screening clinical information and inspection and pathological indexes related to colorectal tumors; referring to the gene abnormal regulation and control relation related to the colorectal tumor and the gene variation marker combination obtained in the step 14.2, integrating and optimizing the gene abnormal regulation and control relation into a colorectal tumor multi-marker combination, constructing a colorectal tumor comprehensive state scoring model, and developing and packaging the colorectal tumor comprehensive state scoring model into a colorectal tumor comprehensive state scoring computing system;
step 14.4) designing a target gene target region related probe and/or primer for colorectal tumor comprehensive state evaluation based on the colorectal tumor related gene abnormal regulation relation and the gene variation marker combination obtained in the step 14.2, and using the probe and/or primer as a colorectal tumor comprehensive state evaluation gene detection panel;
and step 14.5) constructing a combined flow of the colorectal tumor comprehensive state evaluation gene detection panel and the comprehensive state scoring computing system, so that a user can complete detection, information input, calculation evaluation and result acquisition according to the flow.
Specifically, in step 14.1, the colorectal tumor case information is sorted:
step 14.1.1) dividing the colorectal tumor case information into transcriptome data, exome/genomic data and clinical information;
step 14.1.2) the colorectal tumor case information is classified according to disease states and matched.
Specifically, in step 14.2, a colorectal tumor marker combination is constructed, and combination optimization screening is performed by using successive iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm:
if the colorectal tumor case information only relates to transcriptome data and clinical information, executing step 14.2.1) to perform marker mining based on the transcriptome data and the clinical information to construct a colorectal tumor-related gene abnormality regulation and control relationship marker combination;
if the colorectal tumor case information only relates to the exome/genomic data and the clinical information, executing step 14.2.2) performing marker mining based on the exome/genomic data and the clinical information to construct a colorectal tumor-related genetic variation marker combination;
if the colorectal tumor case information includes transcriptome data, exome/genome data and clinical information at the same time, execute step 14.2.3) to perform marker mining based on the transcriptome data, exome/genome data and clinical information, and construct a colorectal tumor-related genetic abnormality regulation relationship and genetic variation marker combination.
In particular, said step 14.2.1) comprises in particular the following sub-steps:
step 14.2.1.1) constructing a reference gene regulation network;
step 14.2.1.2) constructing a condition-specific gene regulation network based on the transcriptome data of the colorectal tumor in a specific disease state and the TF-target relationship of the reference gene regulation network;
step 14.2.1.3) quantifying the gene regulation intensity in the condition-specific gene regulation network and the difference in regulation intensity between networks;
step 14.2.1.4) screening gene abnormal regulation and control relations among condition-specific gene regulation and control networks under different colorectal tumor disease states;
step 14.2.1.5) constructing a colorectal tumor-related gene abnormality regulation relationship marker combination based on the gene abnormality regulation relationship obtained in step 14.2.1.4).
Specifically, in step 14.2.1.2), a feature selection algorithm based on machine learning is adopted, including Boruta, Luta, and Luta, respectively,
Figure BDA0002513819210000101
Bayes, NMF and univariate linear regression, and realizes acceleration by an isomeric calculation or parallelization method, TFs which significantly contribute to TF-target relationship under a disease state are screened, and a gene regulation network with specific conditions, namely a colorectal tumor specific disease state, is formed.
Specifically, in step 14.2.1.3), a multivariate linear regression model is used to quantify the gene regulation intensity in the condition-specific gene regulation network;
performing regression by a De-biased LASSO method, solving to obtain the regulation and control strength and the confidence interval of each gene regulation and control relationship, and judging whether the regulation and control difference is obvious or not by comparing whether the confidence intervals of the same regulation and control relationship in different condition specific gene regulation and control networks are overlapped or not; or the intensity mean value change of the same regulation relation in the specific gene regulation and control network under different conditions is compared, the confidence interval does not need to be calculated, and the regulation and control difference is directly quantified.
Specifically, in step 14.2.1.4), integrating three factors related to gene regulation, and screening the gene abnormal regulation relationship among the condition-specific gene regulation networks of colorectal tumors under different disease states, the method comprises: the gene regulation intensity is obviously changed, the expression level of the regulation target gene is obviously changed, and the regulation intensity change direction of TF to target is consistent with the change direction of the expression level of target; meanwhile, according to the difference degree of the regulation intensity among different disease states, the screened gene abnormal regulation and control relations are sequenced.
Specifically, constructing a colorectal tumor-related gene abnormality regulation and control relationship marker combination in a successive increase iteration based on a greedy algorithm and/or an evolution iteration based on a genetic algorithm in the step 14.2.1.5); for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 14.2.2) comprises in particular the following sub-steps:
step 14.2.2.1) identifying a genetic variation associated with the colorectal tumor;
step 14.2.2.2) quantitative screening of important genetic variations related to colorectal tumor status using data-driven and/or a priori knowledge-driven;
step 14.2.2.3) constructing a colorectal tumor-associated genetic variation marker combination based on the colorectal tumor state-associated significant genetic variation obtained in step 14.2.2.2).
Specifically, in step 14.2.2.2), data quantitative filtering and screening relates to somatic cell genetic variation frequency calculation, sorting and high-frequency variation gene identification, wherein genes with genetic variation frequency more than or equal to 5% are further used for priori knowledge filtering; and (3) filtering and screening the prior knowledge, wherein the filtering and screening comprise application standards, clinical treatment guidelines, drug labels, general knowledge bases and colorectal tumor related genes in literature reports.
Specifically, in step 14.2.2.3), constructing a colorectal tumor-related genetic variation marker combination in successive incremental iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 14.2.3) comprises in particular the following sub-steps:
step 14.2.3.1) for colorectal tumor data sets having both transcriptome data and exome/genome data, screening gene abnormal regulation and control relations related to disease states by using steps 14.2.1.1-14.2.1.4, and mining important gene variations related to disease states by using steps 14.2.2.1-14.2.2.2 to obtain the gene abnormal regulation and control relations and the important gene variations related to colorectal tumors respectively;
step 14.2.3.2) then adopting step 14.2.1.5 and step 14.2.2.3, integrating RNA and DNA information based on successive incremental iteration of a greedy algorithm or evolution iteration based on a genetic algorithm, and constructing a colorectal tumor-related gene abnormal regulation and control relationship and gene variation marker combination.
Specifically, in step 14.3, the screening of the colorectal tumor-related clinical information and the examination and pathological indexes comprises the following steps:
step 14.3.1) screening for clinical information and test and pathological indicators related to colorectal tumor status against known prior knowledge;
step 14.3.2) screening clinical information and examination and pathological indexes related to the colorectal tumor state from the case information in the colorectal tumor queue.
Specifically, in step 14.3, the abnormal regulation and control relationship of the colorectal tumor-associated gene is obtained by the following method:
and combining the obtained colorectal tumor related gene abnormal regulation and control relation and/or gene variation markers, and integrating clinical information and inspection and pathological indexes related to colorectal tumor states obtained by screening in steps 14.3.1 and 14.3.2 synchronously to optimize the colorectal tumor related gene abnormal regulation and control relation and/or gene variation markers into a colorectal tumor multi-marker combination.
Specifically, in the step 14.4, the design of the gene detection panel comprises the following steps:
step 14.4.1) obtaining colorectal tumor related gene abnormal regulation and control relation and/or gene variation marker combination based on screening, finally incorporating the gene combination into a gene set of a colorectal tumor comprehensive state scoring method, combing gene related information in the gene set, removing redundancy and determining a standard gene name;
step 14.4.2) selecting a target gene target region for colorectal tumor detection design for the gene combed in step 14.4.1), which can be used for probe design or primer design;
step 14.4.3) designing corresponding probe and/or primer sequences based on the target gene target region in step 14.4.2), and recording important annotations;
step 14.4.4) aiming at the target gene target region in the step 14.4.2), referring to a data set of probes and/or primers which can be designed in the human genome, and carrying out optimization design on the target gene target region so that the probes and/or the primers can be uniformly captured and cover the target region;
step 14.4.5) comparing the target gene target region related probes and/or primer design regions in steps 14.4.3 and 14.4.4 to obtain target gene target region related probes and/or primer design schemes with optimal coverage;
step 14.4.6) based on the target gene target region-related probes and/or primers designed in step 14.4.5, a gene detection panel for fully assessing the colorectal tumor status was made.
Specifically, in step 14.5, the combined process includes the following steps:
step 14.5.1) based on the gene detection panel designed by the method of the invention, obtaining the quantitative value of the colorectal tumor related gene abnormal regulation relation and/or gene variation marker combination, and inputting the quantitative value into a colorectal tumor comprehensive state scoring computing system;
step 14.5.2) inputting the obtained clinical information related to the colorectal tumor state and the quantitative values of the inspection and pathological indexes into a colorectal tumor comprehensive state scoring computing system;
step 14.5.3) combines the hardware, software and/or online tools involved in steps 14.5.1) and 14.5.2) into a set of matching process, so that the user can complete detection, information input, calculation evaluation and result acquisition according to the requirements.
Specifically, in step 14.2), the colorectal tumor-associated gene abnormal regulation relationship and the gene variation marker combination are combined, and the specific gene set comprises the following 53 genes: RUNX3, GPR15, P2RY8, SNAI3, TLR7, ATOH1, SIGLEC1, KRAS, NRAS, BRAF, HER2, KIT, PDGFRA, SDHA, SDHB, SDHC, SDHD, NF 1; any one of PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E, and combinations thereof; specifically, all 53 genome combinations were used for survival prognosis evaluation; RUNX3, GPR15, P2RY8, SNAI3, TLR7, ATOH1, SIGLEC1 for chemotherapeutic regimen effect prediction; KRAS, NRAS, BRAF, HER2, KIT, PDGFRA, SDHA, SDHB, SDHC, SDHD, NF1 are used for prediction of the effect of targeted treatment regimens, PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E are used for evaluation of colorectal immune tumor and immune infiltration and immune cell toxicity states, and prediction of immune cell inhibition effects.
Step 14.3, colorectal tumor-related clinical information, examination and pathological indexes, 53 genes combined with colorectal tumor-related gene abnormal regulation and control relations and gene variation markers form a colorectal tumor multi-marker combination which is used for prognosis effect, chemotherapy, targeted therapy and immunotherapy effect prediction and assists clinical decision making; specifically, all 53 genes are used for survival prognosis evaluation, and a low-score group of the genes indicates that the prognosis effect of a case is good; wherein RUNX3, GPR15, P2RY8, SNAI3, TLR7, ATOH1 and SIGLEC1 are used for predicting the effect of a chemotherapy scheme (especially in a postoperative scene), and comprise 5-FU and combined ADJC (comprising FOLFIRI, FOLFOX and FUFOL), so that a semi-quantitative chemotherapy scheme based on pathological staging is selected, quantitative scores are provided, and low-score group cases can benefit from chemotherapy more; KRAS, NRAS, BRAF, HER2, KIT, PDGFRA, SDHA, SDHB, SDHC, SDHD, NF1 are used for predicting the effect of a targeted treatment scheme, and the corresponding gene expression or variation scores are closely related to the benefit of a targeted drug, such as HER2 high-score cases which are more likely to benefit from HER2 monoclonal antibody drug treatment; PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1 and HLA-E are used for colorectal tumor immune infiltration and immune cytotoxicity state assessment, the immune low-risk subtypes scored by the above genes have high immune cell infiltration degree, strong immune cell toxicity, high immune checkpoint activation degree and are more likely to benefit from immune checkpoint inhibitor treatment.
Specifically, the coverage of the 53 target gene target region related probes and/or primers designed in the step 14.4 for the colorectal tumor comprehensive state evaluation is not less than 95%, and the coverage of important gene mutation sites therein is not less than 97%; the above 53 target gene target regions can be classified into 3 detection panels as a whole, including chemotherapy status evaluation detection panels (including genes such as RUNX, GPR, P2RY, SNAI, TLR, ATOH, SIGLEC, etc.; and targeted therapy status evaluation detection panels (including genes such as RAS, NRAS, BRAF, HER, KIT, PDGFRA, SDHA, SDHB, SDHC, SDHD, NF, etc.) and immunotherapy status evaluation detection panels (including PD, PDL, CTLA, TIGIT, TIM, LAG, IFNG, CCL, GA, PRF, CXCL, CXFB, SOX, SERPINB, CD8, GZMA, GZMB, PRF, CMCCL, CD274, KLR, CXCR, NKG, IDO, PSMB, STAT, STK, HLA-DQA, HLA-DRB, HLA-DRE, etc.) according to specific uses.
The data acquisition and arrangement in step 14.1 of the invention fully covers the published colorectal tumor data sets including but not limited to TCGA, GEO, ICGC and the like, incorporates information such as survival, medication effect and the like, and realizes systematic mining of transcriptome and exome markers related to the information.
The method of step 14.2 integrates three factors related to gene regulation, and screens the gene abnormal regulation relation between colorectal tumor cGRNs, which comprises the following steps: the TF-target regulation intensity is changed remarkably, the target expression level is changed remarkably, and the TF has the same regulation intensity change direction with the target expression level change direction. Meanwhile, the screened gene abnormal regulation and control relations can be sequenced according to the difference degree of the regulation and control intensity; based on the prediction capability of the prognosis survival and treatment scheme effect of the case, the method adopts successive increase iteration based on a greedy algorithm to mine the related markers of the transcriptome, and the marker combination has the characteristics of accuracy, reliability and strong mechanism interpretability.
The method of step 14.2 of the invention comprehensively adopts a quantitative screening strategy driven by data and priori knowledge, and uses an evolutionary iterative method based on genetic algorithm to screen the high-frequency DNA variation marker combination related to colorectal tumor states such as progression stage, prognosis survival and treatment scheme sensitivity, and the marker combination has the characteristics of accuracy, reliability and strong mechanism interpretability.
The gene set and model system of step 14.3 of the invention can realize the comprehensive state scoring of colorectal cancer patients, and the scoring has higher correlation with the colorectal tumor prognosis survival and treatment means (including but not limited to chemotherapy, targeting, immunosuppressant and the like). Specifically, all input features contribute to survival prognosis; but with different weights for prediction of the efficacy of the treatment modality, wherein the contributions of RUNX3, GPR15, P2RY8, SNAI3, TLR7, ATOH1, SIGLEC1 were focused on prediction of the efficacy of chemotherapeutic regimens, including 5-FU and combined ADJC (including FOLFIRI, FOLFOX and FUFOL), providing effective quantitative scoring support for semi-quantitative chemotherapeutic regimen selection based on pathological staging; KRAS, NRAS, BRAF, HER2, KIT, PDGFRA, SDHA, SDHB, SDHC, SDHD, NF1 are weighted toward the prediction of the effect of targeted treatment regimens, while PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E are weighted toward the immune infiltration states and immune infiltration states, and the prediction of immune cell toxicity effects are evaluated toward the prediction of immune infiltration and immune infiltration states; information on surgical condition (presence/absence), pathological grade (I-IV) and microsatellite instability (MSI) contribute to prognosis and prediction of therapeutic effect.
The combined flow of the panel design and evaluation system in steps 14.4 and 14.5 of the invention can realize higher probe design capture efficiency and target area coverage, and the panel and the scoring module can be flexibly adjusted according to requirements, and are used for the comprehensive state evaluation of colorectal tumor patients, and the assistant clinical decision includes but is not limited to the assistant operation scheme, the chemotherapy scheme and the targeted therapy scheme selection, the immunotherapy reference, the prognosis state evaluation and the like. Flexible adjustment of panel and scoring modules, examples are as follows: the marker combination of only 7 genes (RUNX3, GPR15, P2RY8, SNAI3, TLR7, ATOH1 and SIGLEC1) covered by abnormal regulation 4-DysReg can be used as a small panel, and a relevant state scoring model is reserved, so that a state evaluation flow positioned in an auxiliary chemotherapy scheme of colorectal cancer can be formed. The above ideas are also suitable for independent extraction of the status evaluation process of the target treatment and immunosuppressant treatment scheme, the panel is reduced, and the detection cost is reduced.
The invention provides an application of a method for constructing complex disease state assessment based on high-throughput sequencing data and clinical phenotype in pancreatic ductal carcinoma state assessment, which comprises the following steps:
step 15.1) obtaining pancreatic ductal carcinoma disease case information, including high-throughput sequencing data and clinical information, classifying and carrying out pairing and sorting according to the pancreatic ductal carcinoma disease case states;
step 15.2) constructing a pancreatic ductal carcinoma-related gene abnormal regulation relation and a gene variation marker combination;
step 15.3) screening relevant clinical information and inspection and pathological indexes of pancreatic ductal carcinoma; integrating and optimizing the gene abnormal regulation relation related to the pancreatic ductal carcinoma and the gene variation marker combination obtained in the synchronous step 15.2 into a pancreatic ductal carcinoma multi-marker combination for constructing a pancreatic ductal carcinoma comprehensive state scoring model and developing and encapsulating the pancreatic ductal carcinoma comprehensive state scoring computing system;
step 15.4) designing a target gene target region related probe and/or primer for pancreatic ductal carcinoma comprehensive state evaluation based on the pancreatic ductal carcinoma related gene abnormal regulation relation and the gene variation marker combination obtained in the step 15.2, and using the probe and/or primer as a pancreatic ductal carcinoma comprehensive state evaluation gene detection panel;
and step 15.5) constructing a combined flow of the pancreatic ductal carcinoma comprehensive state evaluation gene detection panel and a comprehensive state scoring computing system, so that a user can complete detection, information input, calculation evaluation and result acquisition according to the flow.
Specifically, in step 15.1, pancreatic ductal carcinoma case information is sorted:
step 15.1.1) dividing the pancreatic ductal carcinoma case information into transcriptome data, exome/genomic data, and clinical information;
step 15.1.2) classifying the pancreatic ductal carcinoma disease case information according to disease states and carrying out pairing and sorting.
Specifically, in step 15.2, a pancreatic ductal carcinoma marker combination is constructed, and combination optimization screening is performed by using successive iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm:
if the pancreatic ductal carcinoma disease case information only relates to the transcriptome data and the clinical information, executing a step 15.2.1) to perform marker mining based on the transcriptome data and the clinical information to construct a pancreatic ductal carcinoma-related gene abnormality regulation relationship marker combination;
if the pancreatic ductal carcinoma disease case information only relates to the exome/genomic data and the clinical information, performing step 15.2.2) performing marker mining based on the exome/genomic data and the clinical information to construct a pancreatic ductal carcinoma-associated genetic variation marker combination;
if the pancreatic ductal carcinoma disease case information simultaneously comprises transcriptome data, exome/genome data and clinical information, executing step 15.2.3) performing marker mining based on the transcriptome data, exome/genome data and clinical information to construct a pancreatic ductal carcinoma-related gene abnormal regulation and control relationship and a gene variation marker combination.
In particular, said step 15.2.1) comprises in particular the following sub-steps:
step 15.2.1.1) constructing a reference gene regulation network;
step 15.2.1.2) constructing a condition-specific gene regulation network based on transcriptome data of pancreatic ductal carcinoma specific disease states and the TF-target relationship of the reference gene regulation network;
step 15.2.1.3) quantifying the gene regulation intensity in the condition-specific gene regulation network and the difference in regulation intensity between networks;
step 15.2.1.4) screening gene abnormal regulation and control relations among condition-specific gene regulation and control networks of pancreatic ductal carcinoma under different disease states;
step 15.2.1.5) constructing a marker combination of the gene abnormal regulation relationship related to pancreatic ductal carcinoma based on the gene abnormal regulation relationship obtained in step 15.2.1.4).
Specifically, in step 15.2.1.2), a feature selection algorithm based on machine learning is adopted, including Boruta, Luta, and Luta, respectively,
Figure BDA0002513819210000171
Bayes, NMF, univariate linear regression, and heterogeneous calculation or parallelization method to realize acceleration, screening TFs which significantly contribute to TF-target relationship in disease state, and forming condition specific gene regulation network of pancreatic ductal carcinoma specific disease state.
Specifically, in step 15.2.1.3), a multivariate linear regression model is used to quantify the gene regulation intensity in the condition-specific gene regulation network;
performing regression by a De-biased LASSO method, solving to obtain the regulation and control strength and the confidence interval of each gene regulation and control relationship, and judging whether the regulation and control difference is obvious or not by comparing whether the confidence intervals of the same regulation and control relationship in different condition specific gene regulation and control networks are overlapped or not; or the intensity mean value change of the same regulation relation in the specific gene regulation and control network under different conditions is compared, the confidence interval does not need to be calculated, and the regulation and control difference is directly quantified.
Specifically, in step 15.2.1.4), three factors related to gene regulation are integrated, and the gene abnormal regulation and control relationship among the condition-specific gene regulation and control networks of pancreatic ductal carcinoma in different disease states is screened, which comprises the following steps: the gene regulation intensity is obviously changed, the regulation target gene expression level is obviously changed, and the regulation intensity change direction of TF to target is consistent with the change direction of target expression level; meanwhile, according to the difference degree of the regulation intensity among different disease states, the screened gene abnormal regulation and control relations are sequenced.
Specifically, constructing a pancreatic ductal carcinoma-associated gene abnormal regulation relationship marker combination in a successive increment iteration based on a greedy algorithm and/or an evolution iteration based on a genetic algorithm in step 15.2.1.5); for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 15.2.2) comprises in particular the following sub-steps:
step 15.2.2.1) identifying genetic variations associated with ductal carcinoma of the pancreas;
step 15.2.2.2) quantitative screening of important genetic variations related to pancreatic ductal carcinoma status using data-driven and/or a priori knowledge-driven;
step 15.2.2.3) constructing a pancreatic ductal carcinoma-associated genetic variation marker combination based on the significant pancreatic ductal carcinoma status-associated genetic variation obtained in step 15.2.2.2).
Specifically, in step 15.2.2.2), data quantitative filtering and screening relates to somatic cell genetic variation frequency calculation, sorting and high-frequency variation gene identification, wherein genes with genetic variation frequency more than or equal to 5% are further used for priori knowledge filtering; and (3) filtering and screening the prior knowledge, wherein the screening comprises application standards, clinical treatment guidelines, drug labels, general knowledge bases and pancreatic ductal carcinoma related genes in literature reports.
Specifically, in step 15.2.2.3), constructing a pancreatic ductal carcinoma-associated genetic variation marker combination in successive incremental iterations based on a greedy algorithm and/or evolutionary iterations based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 15.2.3) comprises in particular the following sub-steps:
step 15.2.3.1) for pancreatic ductal carcinoma datasets having both transcriptome data and exome/genome data, screening gene abnormal regulation and control relationships related to disease states by using steps 15.2.1.1-15.2.1.4, and mining important gene variations related to disease states by using steps 15.2.2.1-15.2.2.2 to obtain the gene abnormal regulation and control relationships and the important gene variations related to pancreatic ductal carcinoma, respectively;
step 15.2.3.2) then adopting step 15.2.1.5 and step 15.2.2.3, integrating RNA and DNA information based on successive incremental iteration of a greedy algorithm or evolutionary iteration based on a genetic algorithm, and constructing a pancreatic ductal carcinoma-related gene abnormal regulation relationship and gene variation marker combination.
Specifically, in the step 15.3, the screening of the pancreatic ductal carcinoma-related clinical information and the test and pathological indexes comprises the following steps:
step 15.3.1) screening the pancreatic ductal carcinoma status-related clinical information and test and pathological indicators against known prior knowledge;
step 15.3.2) screening the pancreatic ductal carcinoma state-related clinical information and examination and pathological indexes based on the case information in the pancreatic ductal carcinoma cohort.
Specifically, in the step 15.3, the pancreatic ductal carcinoma-associated gene abnormal regulation relationship is obtained by the following method:
and combining the obtained pancreatic ductal carcinoma related gene abnormal regulation and control relation and/or gene variation markers, and integrating clinical information and inspection and pathological indexes related to the pancreatic ductal carcinoma state obtained by screening in the synchronous steps 15.3.1 and 15.3.2 to optimize the pancreatic ductal carcinoma related gene abnormal regulation and control relation and/or gene variation markers into a pancreatic ductal carcinoma multi-marker combination.
Specifically, in the step 15.4, the design of the gene detection panel comprises the following steps:
step 15.4.1) obtaining abnormal regulation relation and/or gene variation marker combination of pancreatic ductal carcinoma related genes based on screening, finally incorporating the abnormal regulation relation and/or gene variation marker combination into a gene set of a pancreatic ductal carcinoma comprehensive state scoring method, combing gene related information in the gene set, removing redundancy, and determining a standard gene name;
step 15.4.2) selecting a target gene target region for pancreatic ductal carcinoma detection design against the gene combed in step 15.4.1), which can be used for probe design or primer design;
step 15.4.3) designing corresponding probe and/or primer sequences based on the target gene target region in step 15.4.2), and recording important annotations;
step 15.4.4) aiming at the target gene target region in the step 15.4.2), referring to a data set of probes and/or primers which can be designed in the human genome, and carrying out optimization design on the target gene target region so that the probes and/or the primers can be uniformly captured and cover the target region;
step 15.4.5) comparing the target gene target region related probes and/or primer design regions in steps 15.4.3 and 15.4.4 to obtain target gene target region related probes and/or primer design schemes with optimal coverage;
step 15.4.6) based on the target gene target region-related probes and/or primers designed in step 15.4.5, a gene detection panel for adequately performing the assessment of pancreatic ductal carcinoma status was made.
Specifically, in step 15.5, the combined process includes the following steps:
step 15.5.1) based on the gene detection panel designed by the method of the invention, obtaining the quantitative value of the pancreatic ductal carcinoma related gene abnormal regulation relation and/or gene variation marker combination, and inputting the quantitative value into a pancreatic ductal carcinoma comprehensive state scoring computing system;
step 15.5.2) inputting the obtained clinical information related to the pancreatic ductal carcinoma state and the quantitative values of the inspection and pathological indexes into a pancreatic ductal carcinoma comprehensive state scoring computing system;
step 15.5.3) combines the hardware, software and/or online tools involved in steps 15.5.1) and 15.5.2) into a set of matching process, so that the user can complete detection, information input, calculation evaluation and result acquisition according to the requirements.
Specifically, in step 15.2, the pancreatic ductal carcinoma-associated gene abnormal regulation relationship and the gene variation marker combination include the following 86 genes: AKT1, BRCA2, ERBB2, IDH1, MAP2K2, MTOR, PMS1, APC, CDKN2A, FBXW A, JAK A, MET, NRAS, PMS A, AR, CFTR, FGFR A, KIT, MLH A, NTRK A, PTEN, BRAF, CTNNB A, KRAS, MSH A, PIK3R A, RET, ROS A, BRCA A, EGFR, MAP2K A, SMARCA A, TSC A, SMARCB A, SMAD A, BRAF, HER A, KIT, fra, SDHA, SDHB, SDHC, hd, NF A; any one of PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E, or a combination thereof. Specifically, all 86 genome combinations can be used for survival prognosis evaluation; KRAS/TP53/CDKN2A and all gene copy number variations were used for surgical protocol effect prediction; all gene copy number variations were used for chemotherapy regimen effect prediction; PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E for immune infiltration and immune cytotoxicity status assessment in pancreatic ductal carcinoma patients, and immune checkpoint inhibitor therapy effect prediction; AKT1, BRCA2, ERBB2, IDH1, MAP2K2, MTOR, PMS1, APC, CDKN2A, FBXW7, JAK2, MET, NRAS, PMS2, AR, CFTR, FGFR1, FGFR2, FGFR3, KIT, MLH1, NTRK1, PTEN, BRAF, CTNNB1, KRAS, MSH2, MSH6, PIK3CA, PIK3R1, RET, ROS1, BRCA1, EGFR, MAP2K1, SMARCA4, tscp 53, TSC1, TSC2, SMARCB1, SMAD4, BRAF, HER2, KIT, fra, SDHA, SDHB, SDHC, sthd, 1, STK11 for the prediction of potential targeting of therapeutic effects.
Specifically, the clinical information and the examination and pathological indexes related to the ductal pancreatic cancer in step 15.3 mainly include the clinical information such as the age, sex, blood biochemistry and immunodetection indexes, operation conditions (presence/absence), pathological grades (I-IV), and tumor patient transplantation animal model (PDX) modeling conditions (fast/slow/absence) of the ductal pancreatic cancer patient, and the 86 genes of the ductal pancreatic cancer related gene abnormal regulation and control relationship and the gene variation marker combination of the present invention together form a ductal pancreatic cancer multi-marker combination, which is used for prognosis effect, chemotherapy, immunotherapy, and prediction of potential targeted therapy effect, and assists clinical decision making. Specifically, all 86 groups of genes can be used for survival prognosis evaluation, and the low-score group of the genes indicates that the prognosis effect of a case is good; KRAS/TP53/CDKN2A and all gene copy number variations were used for surgical protocol effect prediction, with low risk classified cases more likely to benefit from R0 paradigm surgical treatment; all 86 gene copy number variants were used for efficacy prediction in chemotherapy regimens, with higher copy variation scores more likely to benefit from gemcitabine (gemcitabine) treatment and lower copy variation scores more likely to benefit from irinotecan (irinotecan) treatment; PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E for assessment of immune infiltration and immune cytotoxicity status in pancreatic ductal carcinoma patients, and prediction of immune checkpoint inhibitor therapeutic effect, immune low subtype cases scored above with genes, high risk of immune cell infiltration, strong immune cytotoxicity, high degree of immune checkpoint, high degree of immune activation, and more likely to benefit from immune checkpoint inhibitors; AKT1, BRCA1, ERBB 1, IDH1, MAP2K1, MTOR, PMS1, APC, CDKN 21, FBXW 1, JAK 1, MET, NRAS, PMS1, AR, CFTR, FGFR1, KIT, MLH1, NTRK1, PTEN, BRAF, CTNNB1, KRAS, MSH 1, PIK3R1, RET, ROS1, BRCA1, EGFR, MAP2K1, SMARCA 1, TSC1, SMARCB1, SMAD1, BRAF, HER 1, KIT, fra, SDHA, SDHB, sdnf, sthd, st3672, nrk 1, for use in therapy to target gene mutations, or to more closely predict the effects of treatment of a relevant disease, such as a patient may benefit from a targeted mutation or mutation. The model condition of a tumor patient transplanted animal model (PDX) can be used for predicting the effect of a surgical plan, and the case of unsuccessful model building is more beneficial to surgery.
Specifically, the 86 target gene target region related probes and/or primers designed in the step 15.4 for the evaluation of the comprehensive pancreatic ductal carcinoma state cover the target region of the target gene by not less than 95%, and cover the important gene mutation sites therein by not less than 97%; the above 86 target gene target regions can be used as a whole to detect panels (for prognosis status evaluation and chemotherapy status evaluation prediction), and can be divided into 3 detection panels according to specific uses, including surgical status evaluation detection panels (KRAS/TP53/CDKN2A and all gene copy number variations), immunotherapy status evaluation detection panels (PD A, PDL A, CTLA A, TIGIT, TIM A, LAG A, IFNG, CCL A, GZMA, PRF A, CXCL A, TGFB A, SOX A, SERPI3672, CD8A, GZMA, GZMB, PRF A, CCL A, CD274, CMKLR A, CXCR A, NKG A, IDO A, PSMB A, STAT A, DQ-A, CTP A, FGFR A, PSNK A, CTFR A, PSNK A, PSN-PSN A, PSN-PSN A, PSN 36, PIK3R1, RET, ROS1, BRCA1, EGFR, MAP2K1, SMARCA4, TP53, TSC1, TSC2, SMARCB1, SMAD4, BRAF, HER2, KIT, PDGFRA, SDHA, SDHB, SDHC, SDHD, NF1, STK 11).
In the invention, the age, sex, pathological grade, blood biochemical and immune indexes (such as CA199 serum concentration and the like), operation conditions R0-R2 and PDX modeling conditions of the pancreatic ductal carcinoma patient are used as supplementary clinical information, and can also be included in the input range of a scoring model.
In the invention, step 15.2 comprehensively adopts a quantitative screening strategy driven by data and priori knowledge to screen a high-frequency DNA variation marker combination related to pancreatic duct states such as a progress stage, prognosis survival and treatment scheme sensitivity, wherein the high-frequency DNA variation marker combination can comprise information such as gene variation, clinical pathology, PDX modeling data and the like, and the marker combination has the characteristics of accuracy, reliability and strong mechanism interpretability. And meanwhile, in the marker combination optimization stage, according to needs, successive increase iteration based on a greedy algorithm or evolution iteration based on a genetic algorithm is flexibly adopted, so that the effect is improved.
In the present invention, the gene set and model system described in step 15.3 can realize the comprehensive status score of the patients with ductal pancreatic cancer, and the score has a high correlation with the prognosis survival and treatment (including but not limited to surgery paradigm, chemotherapy, targeting, immunosuppressant, etc.) effect of the patients with ductal pancreatic cancer. All input features contribute to the survival prognosis; but with different weights for prediction of the efficacy of the treatment regimen, with the contribution of KRAS/TP53/CDKN2A and all gene copy number variations focused on surgical protocol efficacy prediction; the contribution of all gene copy number variations is focused on the prediction of efficacy of chemotherapeutic regimens, in particular gemcitabine (gemcitabine) and irinotecan (irinotecan); PD1, PDL1, PDL2, CTLA4, TIGIT, TIM3, LAG3, IFNG, CCL2, GZMA, PRF1, CXCL8, CXCL9, CXCL10, TGFB1, SOX10, SERPINB9, CD8A, CD8B, GZMA, GZMB, PRF1, CCL5, CD27, CD274, CMKLR1, CXCR6, NKG7, IDO1, PSMB10, STAT1, STK11, HLA-DQA1, HLA-DRB1, HLA-E side emphasis on the assessment of the immune infiltration and immune cytotoxicity status of pancreatic ductal carcinoma patients, with a greater contribution to the prediction of the effect of immunosuppressant treatment regimens; in addition, for targeting drugs that are likely to be used in pancreatic ductal cancer therapy, in part in clinical trials, AKT1, BRCA2, ERBB2, IDH1, MAP2K1, MTOR, PMS1, APC, CDKN 21, FBXW 1, JAK 1, MET, NRAS, PMS1, AR, CFTR, FGFR1, KIT, MLH1, NTRK1, PTEN, BRAF, CTNNB1, KRAS, MSH 1, PIK3R1, RET, ROS1, BRCA1, EGFR, MAP2K1, SMARCA 1, TSC1, smarcsrcb 1, SMAD1, stf 1, pdgf 1, pdg 1, pdgf, fra, SDHA, sdnf 1, and hc may provide valuable reference mutations. Not only clinical information such as the age, sex, pathological grade, blood biochemical and immune indexes (such as CA199 serum concentration and the like) and operation conditions R0-R2 of pancreatic ductal carcinoma patients, but also the PDX modeling conditions of cases contribute to the prognosis effect prediction of the cases.
In the invention, the combined flow of the panel design and evaluation system in the steps 15.4 and 15.5 can realize high capture efficiency of probe design and high coverage of a target region, and the panel and the scoring module can be flexibly adjusted according to requirements, so that the panel and the scoring module can be used for evaluating the comprehensive state of a pancreatic ductal carcinoma patient and assisting clinical decisions including but not limited to surgical schemes, auxiliary chemotherapy schemes and targeted therapy scheme selection, immunotherapy reference, prognosis state evaluation and the like. An example of flexible adjustment of the Panel and scoring module is as follows, 43 genes were selected, including AKT1, BRCA2, ERBB2, IDH1, MAP2K2, MTOR, PMS1, APC, CDKN2A, FBXW7, JAK2, MET, NRAS, PMS2, AR, CFTR, FGFR2, KIT, MLH 2, NTRK 2, PTEN, BRAF, CTNNB 2, KRAS, MSH2, PIK3R 2, RET, ROS 2, BRCA2, EGFR, MAP2K2, SMARCA 2, STK 2, TSC2, smarcr 2, SMARCA 2, etc. to form a small surgical status scoring model and a relevant cancer-assisted surgical procedure. The above ideas are also suitable for independent extraction and construction of the state evaluation processes such as pancreatic ductal carcinoma prognosis and immunosuppressant treatment schemes, so that the panel is reduced, and the detection cost is reduced.
The invention provides an application of a method for constructing a complex disease state evaluation based on high-throughput sequencing data and clinical phenotype in pan-tumor targeted drug susceptibility state evaluation, which comprises gene corresponding DNA mutation and RNA expression information, is suitable for the pan-tumor targeted drug susceptibility evaluation, in particular to the TGFbeta-MAPK-PI3K three-way targeted drug treatment state evaluation, and comprises the following steps:
step 16.1) acquiring pan-tumor cancer case information including high-throughput sequencing data and clinical information, classifying according to the pan-tumor case states, performing pairing and sorting, and determining a mining mode;
step 16.2) constructing a gene abnormality regulation relation marker combination related to pan-tumor targeted drug sensitivity;
step 16.3) screening clinical information and inspection and pathological indexes related to pan-tumor targeted drug sensitivity; integrating and optimizing a plurality of marker combinations related to the pan-tumor targeted drug-sensitive by referring to a gene abnormality regulation relation related to the pan-tumor targeted drug-sensitive obtained in the step 16.2 and the marker combination related to the pan-tumor targeted drug-sensitive, and using the marker combinations to construct a pan-tumor targeted drug-sensitive comprehensive state scoring model and develop and package the score calculation system into a pan-tumor targeted drug-sensitive comprehensive state scoring system;
and step 16.4) designing a target gene target region related probe and/or primer for evaluating the comprehensive state of the pan-tumor targeted drug-sensitive comprehensive state based on the combination of the marker of the abnormal regulation and control relationship of the pan-tumor targeted drug-sensitive related gene obtained in the step 16.2, and using the probe and/or primer as a pan-tumor targeted drug-sensitive pan-tumor targeted comprehensive state evaluation gene detection panel.
And step 16.5) constructing a set of combined flow of the pan-tumor targeted drug-sensitive comprehensive state evaluation gene detection panel and the comprehensive state scoring computing system, so that a user can complete detection, information input, calculation evaluation and result acquisition according to the flow.
Specifically, in step 16.1, classifying and sorting the pan-tumor targeted drug-sensitive case information:
step 16.1.1) dividing the pan-tumor targeted drug-sensitive case information into transcriptome data, exome/genomic data, and clinical information;
step 16.1.2) classifying the pan-tumor targeted drug-sensitive case information according to disease states and carrying out pairing and sorting.
Specifically, in step 16.2, a pan-tumor targeted drug susceptibility marker combination is constructed, and combination optimization screening is performed by using successive iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm:
if the pan-tumor targeted drug-sensitive case information only relates to transcriptome data and clinical information, executing step 16.2.1) to perform marker mining based on the transcriptome data and the clinical information to construct a pan-tumor targeted drug-sensitive related gene abnormal regulation relation marker combination;
if the pan-tumor targeted drug susceptibility case information only relates to the exome/genomic data and the clinical information, executing step 16.2.2) to perform marker mining based on the exome/genomic data and the clinical information to construct a pan-tumor targeted drug susceptibility related gene variation marker combination;
if the information of the pan-tumor targeted drug-sensitive case contains transcriptome data, exome/genome data and clinical information at the same time, executing the step 16.2.3) to perform marker mining based on the transcriptome data, the exome/genome data and the clinical information to construct a pan-tumor targeted drug-sensitive related gene abnormal regulation relationship and a gene variation marker combination.
In particular, said step 16.2.1) comprises in particular the following sub-steps:
step 16.2.1.1) constructing a reference gene regulation network;
step 16.2.1.2) constructing a condition-specific gene regulation network based on the transcriptome data under the specific disease state and the TF-target relationship of the reference gene regulation network;
step 16.2.1.3) quantifying the gene regulation intensity in the condition-specific gene regulation network and the difference in regulation intensity between networks;
step 16.2.1.4) screening gene abnormal regulation and control relations among condition-specific gene regulation and control networks under different disease states;
step 16.2.1.5) constructing a gene abnormality regulation relation marker combination related to pan-tumor targeted drug sensitivity and pan-tumor targeted drug sensitivity based on the gene abnormality regulation relation obtained in step 16.2.1.4).
Specifically, in step 16.2.1.2), a feature selection algorithm based on machine learning is adopted, including Boruta, Luta, and Luta, respectively,
Figure BDA0002513819210000251
Bayes, NMF, univariate linear regression, and through heterogeneous calculation or parallelization method to realize acceleration, screening TFs which significantly contribute to TF-target relationship in disease state, formingCondition-specific, i.e., a network of gene regulation of a particular disease state.
Specifically, in step 16.2.1.3), a multivariate linear regression model is used to quantify the gene regulation intensity in the condition-specific gene regulation network;
performing regression by a De-biased LASSO method, solving to obtain the regulation and control strength and the confidence interval of each gene regulation and control relationship, and judging whether the regulation and control difference is obvious or not by comparing whether the confidence intervals of the same regulation and control relationship in different condition specific gene regulation and control networks are overlapped or not; or the intensity mean value change of the same regulation relation in the specific gene regulation and control network under different conditions is compared, the confidence interval does not need to be calculated, and the regulation and control difference is directly quantified.
Specifically, in step 16.2.1.4), three factors related to gene regulation are integrated, and the gene abnormal regulation relation among the condition-specific gene regulation networks under different disease states is screened, which comprises the following steps: the gene regulation intensity is obviously changed, the regulation target gene expression level is obviously changed, and the regulation intensity change direction of TF to target is consistent with the change direction of target expression level; meanwhile, according to the difference degree of the regulation intensity among different disease states, the screened gene abnormal regulation and control relations are sequenced.
Specifically, in step 16.2.1.5), constructing a pan-tumor targeted drug sensitivity related gene abnormal regulation relation marker combination by successive incremental iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 16.2.2) comprises in particular the following sub-steps:
step 16.2.2.1) identifying genetic variations associated with pan-tumor targeted drug susceptibility;
step 16.2.2.2) quantitative screening of important gene variation related to the pan-tumor targeted drug-sensitive state by data driving and/or priori knowledge driving;
step 16.2.2.3) constructing a pan-tumor targeted drug-sensitive related gene variation marker combination based on the important gene variation related to the pan-tumor targeted drug-sensitive state obtained in step 16.2.2.2).
Specifically, in step 16.2.2.2), data quantitative filtering and screening relates to somatic cell genetic variation frequency calculation, sorting and high-frequency variation gene identification, wherein genes with genetic variation frequency more than or equal to 5% are further used for priori knowledge filtering; and the priori knowledge filtering and screening comprises application standards, clinical treatment guidelines, drug labels, a general knowledge base and genes related to the pan-tumor targeted drug sensitivity and the pan-tumor targeted drug sensitivity in literature reports.
Specifically, in step 16.2.2.3), constructing a pan-tumor targeted drug-sensitivity-related gene variation marker combination by successive incremental iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
In particular, said step 16.2.3) comprises in particular the following sub-steps:
step 16.2.3.1), screening gene abnormal regulation and control relation related to disease state by using the steps 16.2.1.1-16.2.1.4 and mining important gene variation related to disease state by using the steps 16.2.2.1-16.2.2.2 for pan-tumor targeted drug susceptibility data set simultaneously having transcriptome data and exome/genome data to respectively obtain the gene abnormal regulation and control relation and the important gene variation related to pan-tumor targeted drug susceptibility;
step 16.2.3.2) then adopting step 16.2.1.5 and step 16.2.2.3, integrating RNA and DNA information based on successive incremental iteration of a greedy algorithm or evolutionary iteration based on a genetic algorithm, and constructing a pan-tumor targeted drug sensitivity-related gene abnormality regulation relation and gene variation marker combination.
Specifically, in the step 16.3, the screening of clinical information and examination and pathological indexes related to pan-tumor targeted drug sensitivity comprises the following steps:
step 16.3.1) aiming at the known prior knowledge, screening clinical information and inspection and pathological indexes related to the pan-tumor targeted drug susceptibility state;
step 16.3.2) screening clinical information and inspection and pathological indexes related to the pan-tumor targeted drug susceptibility state based on the case information in the pan-tumor targeted drug susceptibility queue.
Specifically, in step 16.3, the abnormal regulation and control relationship of the gene related to pan-tumor targeted drug sensitivity is obtained by the following method:
and integrating the obtained universal tumor targeted drug susceptibility related gene abnormal regulation relation and/or gene variation marker combination with clinical information and inspection and pathological indexes related to the universal tumor targeted drug susceptibility state obtained by screening in the synchronous steps 16.3.1 and 16.3.2, and optimizing the universal tumor targeted drug susceptibility related gene abnormal regulation relation and/or gene variation marker combination into the universal tumor targeted drug susceptibility multi-marker combination.
Specifically, in the step 16.4, the design of the gene detection panel comprises the following steps:
step 16.4.1) obtaining abnormal regulation relation and/or gene variation marker combination of pan-tumor targeted drug sensitivity related genes based on screening, finally incorporating the abnormal regulation relation and/or gene variation marker combination into a gene set of a pan-tumor targeted drug sensitivity comprehensive state scoring method, combing gene related information in the gene set, removing redundancy, and determining a standard gene name;
step 16.4.2) aiming at the gene combed in the step 16.4.1), selecting a target gene target region for pan-tumor targeted drug sensitivity detection design, and using the target gene target region for probe design or primer design;
step 16.4.3) designing corresponding probe and/or primer sequences based on the target gene target region in step 16.4.2), and recording important annotations;
step 16.4.4) aiming at the target gene target region in the step 16.4.2), referring to a data set of probes and/or primers which can be designed in the human genome, and carrying out optimization design on the target gene target region so that the probes and/or the primers can be uniformly captured and cover the target region;
step 16.4.5) comparing the target gene target region related probes and/or primer design regions in steps 16.4.3 and 16.4.4 to obtain target gene target region related probes and/or primer design schemes with optimal coverage;
step 16.4.6) based on the target gene target region related probes and/or primers designed in the step 16.4.5, a gene detection panel for fully performing pan-tumor targeted drug susceptibility status evaluation was made.
Specifically, in step 16.5, the combined process includes the following steps:
step 16.5.1) based on the gene detection panel designed by the method of the invention, obtaining the quantitative value of the abnormal regulation relation of the gene related to the pan-tumor targeted drug susceptibility and/or the gene variation marker combination, and inputting the quantitative value into a pan-tumor targeted drug susceptibility comprehensive state scoring computing system;
step 16.5.2), inputting the obtained clinical information related to the pan-tumor targeted drug susceptibility state and the quantitative values of the inspection and pathological indexes into a pan-tumor targeted drug susceptibility comprehensive state scoring computing system;
step 16.5.3) combines the hardware, software and/or online tools involved in steps 16.5.1) and 16.5.2) into a set of matching process, so that the user can complete detection, information input, calculation evaluation and result acquisition according to the requirements.
Specifically, the pan-tumor targeted drug sensitivity related gene abnormality regulatory relationship marker combination in step 16.2 is particularly suitable for 11 targeted drug treatment schemes related to the TGFbeta pathway, the MAPK pathway and the PI3K pathway, including binimetinib, BKM120, BYL719+ cetuximab + encrafenib, BYL719+ encorafenib, BYL719+ LJM716, cetuximab + encrafenib, CLR457, encrafenib, and the specific gene set includes the following 24 genes: AXIN1, JUNB, MYC, SMAD5, SMAD4, TGIF2, UBB, ATF3, BMPR2, JUND, KLF10, NR2C2, PPP1CB, SKIL, SMURF1, SP1, TP53, PITX2, TFDP2, E2F4, SMAD1, KLF6, SMAD3, KLF 11. Meanwhile, for the TGFbeta pathway related targeting drug of gastrointestinal tumor, four genes such as BMPR2, MYC, TFDP2 and TGIF2 can be used as a marker combination of gene abnormality regulation and control relationship.
Specifically, in the step 16.3, the multi-element marker combination construction method of the pan-tumor targeted drug susceptibility utilizes successive incremental iteration based on a greedy algorithm or evolutionary iteration based on a genetic algorithm to perform multi-element marker combination optimization, and a pan-tumor targeted drug susceptibility comprehensive state scoring model is constructed by adopting a machine learning classification algorithm, a decision tree, a random forest and an SVM, and is developed and packaged into a pan-tumor targeted drug susceptibility comprehensive state scoring calculation system for predicting the targeted drug administration effect of a pan-tumor case.
Specifically, the clinical information and the examination and pathological indexes related to the target medication of the pan-tumor patient in the step 16.3 mainly comprise the clinical information such as the age, sex, blood biochemistry and immunodetection indexes, operation conditions (existence/nonexistence), pathological grades (differentiation degree/TNM stage), metastasis, treatment and the like of the tumor patient, and the 24 genes of the marker combination related to the abnormal regulation and control of the pan-tumor targeted drug susceptibility gene form the multi-marker combination of the pan-tumor targeted drug susceptibility for predicting the effect of the target medication of the pan-tumor, particularly predicting the treatment effect of the TGFbeta-MAPK-PI3K three-way target medication and assisting the clinical decision. Specifically, a system for developing a comprehensive state score of pan-tumor targeted drug susceptibility can be constructed based on a multi-marker combination of pan-tumor targeted drug susceptibility, and is used for predicting treatment benefit of 6 single-drug treatment schemes (including binimetinib, BKM120, BYL719, cetuximab, CLR457 and encorafenib) and 5 combined treatment schemes (including BYL719+ cetuximab, BYL719+ cetuximab + encorafenib, BYL719+ LJM716 and cetuximab + encorafenib) related to a pan-tumor case TGFbeta-MAPK-PI3K, and assisting clinical decision making.
Specifically, the probes and/or primers related to the target regions of the 24 target genes for pan-tumor targeted drug susceptibility state evaluation designed in step 16.4 cover the target regions of the target genes by not less than 95%, and cover the important gene mutation sites therein by not less than 97%.
The method for evaluating the state of the pan-tumor targeted medication scheme and the application thereof have the advantages that the data collection and arrangement in the step 16.1 fully covers the published pan-tumor medication data set, and fully utilizes patient queues and animal experimental data including but not limited to TCGA, GEO, NIBR PDXE and the like.
The invention discloses a method for evaluating the state of a pan-tumor targeted drug administration scheme and application, wherein the method in step 16.2 integrates three factors related to gene regulation and screens the gene abnormal regulation and control relation between special cGRNs of pan-tumor adjuvant drugs, and comprises the following steps: the TF-target regulation intensity is changed remarkably, the target expression level is changed remarkably, and the TF has the same regulation intensity change direction with the target expression level change direction. Meanwhile, the screened gene abnormal regulation and control relations can be sequenced according to the difference degree of the regulation and control intensity; and mining the related markers and combinations of the transcriptome based on the effect prediction capability of all collected medication schemes (including but not limited to targeted drug single use, targeted drug combined use and the like), wherein the marker combinations have the characteristics of accuracy, reliability and strong mechanism interpretability. Meanwhile, a quantitative screening strategy driven by data and priori knowledge is comprehensively adopted, and in a marker combination optimization stage, successive increase iteration based on a greedy algorithm or evolution iteration based on a genetic algorithm is flexibly adopted according to needs, so that the effect is improved.
The method can realize the construction of a gene set for evaluating the effect of a pan-tumor targeted drug treatment scheme based on a biological pathway and can realize the comprehensive state scoring of the adjuvant drug treatment of a pan-tumor patient, and the scoring is closely related to the treatment effect of the pan-tumor targeted drug. Here 11 targeted drug regimens enriched for the TGFbeta, MAPK and PI3K pathways include binimetinib, BKM120, BYL719+ cetuximab + encorafenib, BYL719+ LJM716, cetuximab + encorafenib, CLR457, encorafenib, the gene set used in the assessment model includes 24 genes, i.e. including AXIN1, JUNB, MYC, SMAD5, SMAD4, TGIF2, b, ATF3, BMPR2, JUND, KLF 9638, NR2C2, PPP1CB, ski, SMURF1, SP 2, TP 638, pitp 2, pik 3692, smdp 3527, tff 6329, 3, etc.
According to the pan-tumor targeted medication scheme state evaluation method and the application, the combined flow of the pan design and evaluation system in the steps 16.4 and 16.5 can realize higher probe design capture efficiency and higher target area coverage, and the pan and the scoring module can be flexibly adjusted according to requirements, so that the comprehensive state scoring of the adjuvant medication and treatment of pan-tumor patients is realized, the clinical decision is effectively assisted, and the treatment effect is improved. An example of flexible adjustment of Panel and scoring module is as follows, a small Panel composed of 4 genes such as BMPR2, MYC, TFDP2, TGIF2, etc., which can be used for detecting the expression level by PCR and matching with a corresponding scoring model for evaluating the treatment status of the gastrointestinal tract related tumor Cetuximab. The above ideas are also applicable to customized independent extraction of characteristic genes and clinical information, reduction of panel and reduction of detection cost aiming at other tumor types and medication schemes.
The invention has the advantages that a gene regulation and control network with specific conditions is constructed based on transcriptome expression data, so that the abnormal regulation and control relationship of genes can be identified; and contains more than one identification strategy; a marker can be constructed by the gene abnormal regulation and control relation; the construction process comprises two screening strategies, namely successive increase iteration based on a greedy algorithm and evolution iteration based on a genetic algorithm, and finally the marker with both accuracy and mechanism explanatory property can be constructed and used for prognosis evaluation of complex diseases, prediction of treatment effects, auxiliary decision of treatment schemes and the like.
The beneficial effects of the invention also include identification of important gene variation related to complex diseases; and has different identification strategies, such as data-driven quantitative screening and knowledge base filtering screening and the combination thereof; markers can be constructed by important variant genes on the DNA layer related to the complex diseases; the construction process comprises two screening strategies, namely successive increase iteration based on a greedy algorithm and evolution iteration based on a genetic algorithm, and the finally constructed marker can be used for complex disease prognosis evaluation, treatment effect prediction, treatment scheme auxiliary decision and the like; and can realize the integration and utilization of RNA data and DNA data, the method is flexible and various, the marker combination system has both accuracy and mechanism interpretability.
The method has the advantages that rich technical means can be utilized, high-throughput sequencing data, clinical information and multivariate information of knowledge base sources are fully integrated to construct a comprehensive scoring system; the method comprises the strategies and functions of system mining and retrieval of clinical and pharmaceutical guidelines and open documents, effective utilization of clinical information, construction of a comprehensive scoring computing system and the like; meanwhile, a gene detection panel design scheme matched with a comprehensive scoring computing system is provided; and comprises the design of gene probe target area, the design of probe coverage, and the quality control based on the coverage; and the combined process of the gene detection panel and the comprehensive scoring system is provided, and the combined process comprises a comprehensive state evaluation model function, an input and output function and a possible concept, and a combined form and a possible concept.
The beneficial effects of the invention also include providing the marker excavation and evaluation model construction and the panel design scheme of the pan tumor adjuvant drugs; and can be used for TGFbeta pathway, MAPK pathway and PI3K pathway targeted medication regimen status assessment, including 11 treatment regimens, including single drug and combination therapy; and can be used for auxiliary decision of TGFbeta-MAPK-PI3K pathway related targeted treatment schemes of various tumors including colorectal tumors, liver cancer, lung cancer and the like.
Drawings
FIG. 1 shows the prediction ability of gene abnormality regulation relationship on pan-tumor drug sensitivity results.
FIG. 2 shows the drug sensitivity prediction results of TGFbeta pathway genes such as BMPR2/MYC/TFDP2/TGIF2 on Cetuximab in CRC PDX and GSE5851 data sets.
FIG. 3 is an application schematic diagram of a comprehensive state evaluation process of pan-tumor targeted drug sensitivity.
The attached table 1 shows information of 18 therapeutic drugs and identification conditions of abnormal regulation and control relationships thereof.
Detailed Description
The invention is further illustrated below with reference to examples and figures. It should be understood that these examples are only for illustrating the present invention, and are not to be construed as limiting the scope of the present invention. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, which is set forth in the following claims and their equivalents.
The embodiment of the invention is applied to construction of a pan-tumor adjuvant drug marker mining and evaluation model and design of panel, and the invention is further described in detail by combining specific embodiments, which are only used for illustrating the invention and are not used for limiting the scope of the invention. The method comprises the following specific steps:
s4.1 pan-tumor sequencing and clinical pharmacodynamic phenotype data set Collection
S4.1.1 obtaining RNA-seq data and drug sensitive reaction data of a CRC PDX model from NIBR PDXE data set of Nowa, and finally obtaining 51 samples with the RNA-seq data and the drug sensitive reaction data simultaneously, wherein the samples comprise 21 different drug treatments. RNA-seq data quantitate gene expression levels as FPKM, with FPKM values less than 0.1 treated as deletion values. When the deletion value is more than 20% of the total sample size, the gene is deleted, and the residual deletion data is filled by adopting a kNN method. Finally, log2(RNA-seq +1) conversion is carried out on the RNA-seq data for subsequent marker mining.
S4.1.2 CRC dataset GSE5851 containing the EGFR inhibitor Cetuximab effect was downloaded from GEO (https:// www.ncbi.nlm.nih.gov/GEO /), and for the case where one probe was able to map multiple genes, the corresponding probe was removed from the dataset; in the case where a plurality of probes are mapped to one gene, the maximum value of the corresponding plurality of probes in each sample is used as the expression value of the gene in each sample. Treating the expression value smaller than 1 as a deletion value, deleting the gene when the deletion value is larger than 20% of the total sample volume, and filling the residual deletion data by adopting a kNN method; and (3) carrying out inter-sample standard by using a quantile method, and carrying out log2 conversion for verifying the markers screened by the CRC PDX model. The Cetuximab drug sensitivity levels are recorded as "complete response", "partial response", "stable disease" (the three are merged into a response group), "progressive disease" (recorded as an unresponsive group), "non to be determined" (such samples are removed).
S4.2 mining pan-tumor adjuvant drug assessment biomarkers based on gene abnormal regulation and control relationship
S4.2.1 the effect of the drug in NIBR PDXE dataset in Norway is divided into four grades, Complete Remission (CR), Partial Remission (PR), disease Stability (SD) and disease Progression (PD), CR, PR, SD are response groups and PD is non-response group; drugs in both the response group and the non-response group were selected for more than 10 samples, and finally 18 treatment regimens were selected.
S4.2.2 referring to step 2.1, the transcriptome data and clinical information are mined, GRNs with response condition and no response condition are respectively constructed by using Boruta algorithm, the regulation intensity and confidence interval of each regulation relation are quantified by using de-biased LASSO method, and the gene abnormal regulation relation is identified by integrating three characteristics of significant change of regulation intensity, target differential expression and key regulation factor with TF as target. The information of the 18 therapeutic drugs and their abnormal regulation and control relationship can be seen in the attached table 118.
S4.2.3, constructing a prognosis state assessment marker combination by referring to successive increase iteration based on a greedy algorithm in the step 2.1.5, performing cross validation, and comparing whether the accuracy of the abnormal regulation and control relationship identified according to the scheme provided by the invention on drug sensitivity prediction is significantly higher than the accuracy of two genes randomly extracted in different types on drug sensitivity prediction. Of the 18 treatments, 13 treatments identified abnormal regulatory relationships significantly more accurate in the prediction of drug efficacy than the four controls, including binimetinib, BKM120+ LJC049, BYL719+ cetuximab + encorafenib, BYL719+ LJM716, cetuximab, CGM097, CLR457, encorafenib, HDM201, LKA 136. The results can be seen in the prediction ability of the gene abnormal regulation relation in the attached figure 1 on the pan-tumor drug sensitivity results.
S4.2.4 through successive increase iteration based on a greedy algorithm in the step 2, path enrichment analysis matched with an abnormal regulation and control relationship pair can find that 11 medication schemes are enriched to a TGFbeta path, a MAPK path and a PI3K path, and the interpretability and evidence-based medical reliability of the marker are greatly improved. The 11 medication schemes specifically comprise binimetinib, BKM120, BYL719+ cetuximab + encorafenib, BYL719+ LJM716, cetuximab + encorafenib, CLR457, encorafenib; the tumor medication state evaluation marker after combination optimization is obtained by calculating AUC (acute coronary syndrome) by using a ROC curve and consists of the following genes, including AXIN1, JUNB, MYC, SMAD5, SMAD4, TGIF2, UBB, ATF3, BMPR2, JUND, KLF10, NR2C2, PPP1CB, SKIL, SMURF1, SP1, TP53, PI 2, TFDP2, E2F4, SMAD1, KLF6, SMAD3, KLF11 and the like, wherein the effect of the genes on at least one medication regimen is predicted to be more than AUC 0.7.
S4.2.5 the prediction ability of the above markers on the Cetuximab treatment effect is verified in the GSE5851 data set, and it is found that four genes such as BMPR2, MYC, TFDP2 and TGIF2 not only show color on CRC PDX in NIBR PDXE, but also have excellent performance on the Cetuximab efficacy prediction in the GSE5851 data set. The above results show the drug sensitivity prediction results of TGFbeta pathway genes such as BMPR2/MYC/TFDP2/TGIF2 on Cetuximab in CRC PDX and GSE5851 data sets in FIG. 2.
S4.3 TGFbeta pathway, MAPK pathway and PI3K pathway targeted medication scheme state evaluation gene set panel design and comprehensive scoring system development
S4.3.1 combing the test information of 24 genes screened out by S4.2.4 and then determining the standard gene name by NCBI office name or HGNC advanced office Symbol system. Specific gene sets include AXIN1, JUNB, MYC, SMAD5, SMAD4, TGIF2, UBB, ATF3, BMPR2, JUND, KLF10, NR2C2, PPP1CB, ski, SMURF1, SP1, TP53, PITX2, TFDP2, E2F4, SMAD1, KLF6, SMAD3, KLF11, and the like.
S4.3.2 referring to the gene detection panel design method in the step 4, completing the design of TGFbeta-MAPK-PI3K three-way panel detection panel, and performing corresponding optimization according to PCR or high-throughput sequencing platform, such as small panel composed of 4 genes, e.g. BMPR2, MYC, TFDP2, TGIF2, etc., and detecting the expression quantity by PCR; all 24 genes can be detected by using a panel design to capture relevant sequences and using high-throughput sequencing technology. The capture efficiency is generally between 30% and 60%, and the coverage of all gene target regions is not less than 95%, so that the design of the probe can be determined to be qualified.
S4.3.3 according to the input mode of the panel detection value and clinical information of case, developing the TGFbeta-MAPK-PI3K three-way target medication comprehensive state scoring system in python language by adopting SVM, respectively training two models according to 4 genes and 24 genes, packaging and encapsulating in a software system, setting judgment parameters to facilitate the use of matched panel by users. The software system can utilize the evaluation model to complete calculation and output TGFbeta-MAPK-PI3K three-way target medication comprehensive state score of the individual case to be evaluated and corresponding information such as treatment benefit prediction prompt, and the like, thereby assisting clinical decision and improving treatment effect.
Information of 118 therapeutic drugs in attached table and identification condition of abnormal regulation and control relationship thereof
Figure BDA0002513819210000331
Figure BDA0002513819210000341

Claims (12)

1. The application of a state evaluation model constructed based on high-throughput sequencing data and clinical phenotype in pan-tumor targeted drug susceptibility state evaluation is characterized by comprising the following steps of:
step 16.1) acquiring pan-tumor cancer case information, including high-throughput sequencing data and clinical information, classifying according to the pan-tumor case state, and performing pairing and sorting;
step 16.2) constructing a gene abnormality regulation relation marker combination related to pan-tumor targeted drug sensitivity; wherein the specific gene set of the marker combination comprises the following 24 genes: AXIN1, JUNB, MYC, SMAD5, SMAD4, TGIF2, UBB, ATF3, BMPR2, JUND, KLF10, NR2C2, PPP1CB, SKIL, SMURF1, SP1, TP53, PITX2, TFDP2, E2F4, SMAD1, KLF6, SMAD3, KLF 11;
step 16.3) screening clinical information and inspection and pathological indexes related to pan-tumor targeted drug sensitivity; integrating and optimizing the gene abnormal regulation relation related to the pan-tumor targeted drug susceptibility and the gene abnormal regulation relation marker combination related to the pan-tumor targeted drug susceptibility obtained in the step 16.2 into a multi-marker combination related to the pan-tumor targeted drug susceptibility, constructing a pan-tumor targeted drug susceptibility comprehensive state scoring model, and developing and packaging the pan-tumor targeted drug susceptibility comprehensive state scoring model into a pan-tumor targeted drug susceptibility comprehensive state scoring computing system;
step 16.4) designing a target gene target region related probe and/or primer for the evaluation of the comprehensive state of the pan-tumor targeted drug susceptibility based on the marker combination of the abnormal regulation and control relation of the pan-tumor targeted drug susceptibility related gene obtained in the step 16.2, and using the probe and/or primer as a pan-tumor targeted drug susceptibility comprehensive state evaluation gene detection panel;
and step 16.5) constructing a combined flow of the pan-tumor targeted drug sensitivity comprehensive state evaluation gene detection panel and the comprehensive state scoring computing system, so that a user can complete detection, information input, calculation evaluation and result acquisition according to the flow.
2. The use according to claim 1, wherein in step 16.2, a pan-tumor targeted drug susceptibility marker combination is constructed, and combinatorial optimization screening is performed using greedy algorithm-based successive iterations and/or genetic algorithm-based evolutionary iterations:
if the pan-tumor targeted drug susceptibility case information only relates to transcriptome data and clinical information, executing a step 16.2.1) to perform marker mining based on the transcriptome data and the clinical information to construct a pan-tumor targeted drug susceptibility related gene abnormality regulation relation marker combination;
if the pan-tumor targeted drug susceptibility case information only relates to the exome/genome data and the clinical information, executing step 16.2.2) performing marker mining based on the exome/genome data and the clinical information to construct a pan-tumor targeted drug susceptibility related gene variation marker combination;
if the pan-tumor targeted drug susceptibility case information simultaneously comprises transcriptome data, exome/genome data and clinical information, executing step 16.2.3) to perform marker mining based on the transcriptome data, exome/genome data and clinical information to construct a pan-tumor targeted drug susceptibility related gene abnormality regulation relationship and a gene variation marker combination.
3. The use according to claim 2, wherein said step 16.2.1) comprises in particular the sub-steps of:
step 16.2.1.1) constructing a reference gene regulation network;
step 16.2.1.2) constructing a condition-specific gene regulation network based on the transcriptome data under the specific disease state and the TF-target relationship of the reference gene regulation network;
step 16.2.1.3) quantifying the gene regulation intensity in the condition-specific gene regulation network and the difference in regulation intensity between networks;
step 16.2.1.4) screening gene abnormal regulation and control relations among condition-specific gene regulation and control networks under different disease states;
step 16.2.1.5) constructing a pan-tumor targeted drug sensitivity-related gene abnormality regulation relationship marker combination based on the gene abnormality regulation relationship obtained in step 16.2.1.4).
4. The use of claim 2, wherein in step 16.2.1.2), a machine learning based feature selection algorithm is used, including Boruta, Virginia,
Figure FDA0002513819200000021
bayes, NMF and univariate linear regression, and realizes acceleration by an isomeric calculation or parallelization method, TFs which significantly contribute to TF-target relationship under a disease state are screened, and a condition-specific gene regulation network, namely a specific disease state, is formed; and/or the presence of a gas in the gas,
in step 16.2.1.3), a multivariate linear regression model is used to quantify the gene regulation intensity in the condition-specific gene regulation network;
performing regression by a De-biased LASSO method, solving to obtain the regulation and control strength and the confidence interval of each gene regulation and control relationship, and judging whether the regulation and control difference is obvious or not by comparing whether the confidence intervals of the same regulation and control relationship in different condition specific gene regulation and control networks are overlapped or not; or the intensity mean value change of the same regulation relation in the specific gene regulation and control network under different conditions is compared, the confidence interval does not need to be calculated, and the regulation and control difference is directly quantified; and/or the presence of a gas in the gas,
step 16.2.1.4), integrating three factors related to gene regulation, and screening gene abnormal regulation and control relations among condition-specific gene regulation and control networks under different disease states, wherein the method comprises the following steps: the gene regulation intensity is obviously changed, the regulation target gene expression level is obviously changed, and the regulation intensity change direction of TF to target is consistent with the change direction of target expression level; meanwhile, sorting the screened gene abnormal regulation and control relations according to the difference degree of the regulation and control intensity among different disease states; and/or the presence of a gas in the gas,
step 16.2.1.5), constructing a pan-tumor targeted drug sensitivity related gene abnormal regulation relation marker combination by successive incremental iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
5. The use according to claim 2, wherein said step 16.2.2) comprises in particular the sub-steps of:
step 16.2.2.1) identifying genetic variations associated with pan-tumor targeted drug susceptibility;
step 16.2.2.2) adopting data-driven and/or priori knowledge-driven quantitative screening of important gene variation related to pan-tumor targeted drug susceptibility state;
step 16.2.2.3) constructing a pan-tumor targeted drug susceptibility related gene variation marker combination based on the important gene variation related to the pan-tumor targeted drug susceptibility state obtained in step 16.2.2.2).
6. The use of claim 5, wherein in step 16.2.2.2), the quantitative data filtering screening involves somatic genetic variation frequency calculation, sorting, and high-frequency variation gene identification, wherein genes with a genetic variation frequency of 5% or more are further used for prior knowledge filtering; filtering and screening priori knowledge, wherein the prior knowledge comprises application standards, clinical treatment guidelines, drug labels, general knowledge bases and pan-tumor targeted drug sensitivity related genes in literature reports; and/or the presence of a gas in the gas,
step 16.2.2.3), constructing a pan-tumor targeted drug sensitivity related gene variation marker combination by successive incremental iteration based on a greedy algorithm and/or evolutionary iteration based on a genetic algorithm; for the marker combination, C-index is used as an index to measure the prediction effect of the marker combination on disease prognosis states, or AUC is used as an index to measure the prediction effect of the marker combination on treatment scheme benefit states.
7. The use of claim 1, wherein the step 16.3) of screening clinical information and test and pathological indicators related to pan-tumor targeted drug susceptibility comprises the steps of:
step 16.3.1) aiming at the known prior knowledge, screening clinical information and inspection and pathological indexes related to the pan-tumor targeted drug susceptibility state;
step 16.3.2) screening clinical information and inspection and pathological indexes related to the pan-tumor targeted drug susceptibility state from the case information in the pan-tumor targeted drug susceptibility queue; and/or the presence of a gas in the gas,
in the step 16.3), the abnormal regulation and control relation of the pan-tumor targeted drug sensitivity related gene is obtained by the following method:
combining the obtained gene abnormality regulation relation and/or gene variation marker related to the pan-tumor targeted drug susceptibility, and integrating clinical information and inspection and pathological indexes related to the pan-tumor targeted drug susceptibility state obtained by screening 16.3.1) and 16.3.2) to optimize the combination into a pan-tumor targeted drug susceptibility multi-marker combination.
8. The use according to claim 1, wherein in step 16.5) the combined procedure comprises the following steps:
step 16.5.1) based on the designed gene detection panel, obtaining the quantitative value of the gene abnormal regulation relation and/or gene variation marker combination related to the pan-tumor targeted medication, and inputting the quantitative value into the pan-tumor comprehensive state scoring computing system of the targeted medication;
step 16.5.2), inputting the obtained clinical information related to the target drug-using pan-tumor state and the quantitative values of the inspection and pathological indexes into a target drug-using pan-tumor comprehensive state scoring computing system;
step 16.5.3) combines the hardware, software and/or online tools involved in steps 16.5.1) and 16.5.2) into a set of matching process, so that the user can complete detection, information input, calculation evaluation and result acquisition according to the requirements.
9. The use according to claim 1, wherein in step 16.2, the targeted pan-tumor associated gene abnormality regulatory relationship marker combination is suitable for use in 11 targeted drug treatment regimens associated with the TGFbeta pathway, the MAPK pathway and the PI3K pathway, including binimetinib, BKM120, BYL719+ cetuximab + encrafenib, BYL719+ LJM716, cetuximab + encrafenib, CLR457, encrafenib; for TGFbeta pathway related targeting drug of gastrointestinal tumor, BMPR2, MYC, TFDP2 and TGIF2 are used as gene abnormality regulatory relation marker combinations.
10. The use of claim 1, wherein in step 16.3), clinical information and examination and pathology indexes related to the targeted drug administration of pan-tumor patients, including age, sex, blood biochemistry and immunodetection indexes, surgery condition, pathology grade, metastasis and treatment clinical information of tumor patients, and 24 genes of the targeted drug-administered pan-tumor-related gene abnormality regulation relation marker combination, together form a targeted drug-administered pan-tumor multi-marker combination for prediction of therapeutic effect of the targeted drug-administered pan-tumor, including TGFbeta-MAPK-PI3K three-way targeted drug administration therapeutic effect prediction, and assistance in clinical decision making; based on the targeted drug-induced pan-tumor multi-marker combination, a system for developing a targeted drug-induced pan-tumor comprehensive state score is constructed and used for predicting treatment benefit conditions of 6 single-drug treatment schemes (including binimetinib, BKM120, BYL719, cetuximab, CLR457 and encorafenib) and 5 combined treatment schemes (including BYL719+ cetuximab, BYL719+ cetuximab + encorafenib, BYL719+ LJM716 and cetuximab + encorafenib) related to a pan-tumor case TGFbeta-MAPK 3K, and assisting clinical decision making.
11. The application of claim 1, wherein in step 16.3), the targeted drug-using pan-tumor multivariate marker combination construction method utilizes greedy algorithm-based successive increase iteration or genetic algorithm-based evolution iteration to perform multivariate marker combination optimization, constructs a targeted drug-using pan-tumor comprehensive state scoring model by machine learning classification algorithm and adopting any one or combination of decision trees, random forests and SVM, and develops and encapsulates the model into a targeted drug-using pan-tumor comprehensive state scoring calculation system for predicting the targeted drug-using effect of a pan-tumor case.
12. The use of claim 1, wherein in step 16.4), the probes and/or primers related to the target region of 24 target genes designed for targeted pharmacotherapy tumor status evaluation cover the target region of the target gene by not less than 95%, and cover the important gene variant sites therein by not less than 97%.
CN202010469448.3A 2020-05-28 2020-05-28 Method and application of pan-tumor targeted drug sensitivity state assessment model constructed based on high-throughput sequencing data and clinical phenotypes Active CN111640508B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010469448.3A CN111640508B (en) 2020-05-28 2020-05-28 Method and application of pan-tumor targeted drug sensitivity state assessment model constructed based on high-throughput sequencing data and clinical phenotypes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010469448.3A CN111640508B (en) 2020-05-28 2020-05-28 Method and application of pan-tumor targeted drug sensitivity state assessment model constructed based on high-throughput sequencing data and clinical phenotypes

Publications (2)

Publication Number Publication Date
CN111640508A true CN111640508A (en) 2020-09-08
CN111640508B CN111640508B (en) 2023-08-01

Family

ID=72332975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010469448.3A Active CN111640508B (en) 2020-05-28 2020-05-28 Method and application of pan-tumor targeted drug sensitivity state assessment model constructed based on high-throughput sequencing data and clinical phenotypes

Country Status (1)

Country Link
CN (1) CN111640508B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112280863A (en) * 2020-11-06 2021-01-29 南京普恩瑞生物科技有限公司 Method and kit for effectiveness of targeted drug apatinib
CN113555070A (en) * 2021-05-31 2021-10-26 宋洋 Machine learning algorithm for constructing drug sensitivity related gene classifier of acute myeloid leukemia
CN113628684A (en) * 2021-08-06 2021-11-09 苏州鸿晓生物科技有限公司 Sample bacterial species detection methods and systems
CN113707223A (en) * 2021-04-21 2021-11-26 吴安华 Gene set system and method for predicting activity state and treatment sensitivity of tumor inflammasome
CN115472216A (en) * 2022-11-14 2022-12-13 神州医疗科技股份有限公司 Data integration-based cross-adaptive tumor drug combination recommendation method and system
CN116597902A (en) * 2023-04-24 2023-08-15 浙江大学 Method and device for screening multiple groups of chemical biomarkers based on drug sensitivity data
CN116863998A (en) * 2023-06-21 2023-10-10 扬州大学 Genetic algorithm-based whole genome prediction method and application thereof

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110136347A (en) * 2010-06-15 2011-12-21 재단법인 아산사회복지재단 Snp for predicting sensitivity to an anti-cancer targeted agent
CN102424840A (en) * 2011-12-20 2012-04-25 上海市肿瘤研究所 Urine-based method and kit for diagnosing relapse risk of bladder cancer patient
CN104450948A (en) * 2014-12-31 2015-03-25 北京圣谷同创科技发展有限公司 Cancer detecting method, kit and application thereof
CA2927752A1 (en) * 2013-10-18 2015-04-23 The Regents Of The University Of Michigan Systems and methods for determining a treatment course of action
KR20160144318A (en) * 2015-06-08 2016-12-16 한국과학기술원 Apparatus and method for companion diagnosis
CN108034726A (en) * 2018-01-18 2018-05-15 四川大学华西医院 Detect purposes of the reagent of MLH1 expressions in cancer target Drug Sensitivity detection kit is prepared
CN109609647A (en) * 2019-01-25 2019-04-12 臻悦生物科技江苏有限公司 Detection Panel, detection kit and its application for the targeting of general cancer kind, chemotherapy and immune medication based on the sequencing of two generations
CA3061736A1 (en) * 2017-12-01 2019-06-06 Illumina, Inc. Systems and methods for assessing drug efficacy
CN110904235A (en) * 2019-12-20 2020-03-24 深圳市新合生物医疗科技有限公司 Gene panel for detecting tumor targeted drug related gene mutation, method, application and kit

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110136347A (en) * 2010-06-15 2011-12-21 재단법인 아산사회복지재단 Snp for predicting sensitivity to an anti-cancer targeted agent
CN102424840A (en) * 2011-12-20 2012-04-25 上海市肿瘤研究所 Urine-based method and kit for diagnosing relapse risk of bladder cancer patient
CA2927752A1 (en) * 2013-10-18 2015-04-23 The Regents Of The University Of Michigan Systems and methods for determining a treatment course of action
CN104450948A (en) * 2014-12-31 2015-03-25 北京圣谷同创科技发展有限公司 Cancer detecting method, kit and application thereof
KR20160144318A (en) * 2015-06-08 2016-12-16 한국과학기술원 Apparatus and method for companion diagnosis
CA3061736A1 (en) * 2017-12-01 2019-06-06 Illumina, Inc. Systems and methods for assessing drug efficacy
CN108034726A (en) * 2018-01-18 2018-05-15 四川大学华西医院 Detect purposes of the reagent of MLH1 expressions in cancer target Drug Sensitivity detection kit is prepared
CN109609647A (en) * 2019-01-25 2019-04-12 臻悦生物科技江苏有限公司 Detection Panel, detection kit and its application for the targeting of general cancer kind, chemotherapy and immune medication based on the sequencing of two generations
CN110904235A (en) * 2019-12-20 2020-03-24 深圳市新合生物医疗科技有限公司 Gene panel for detecting tumor targeted drug related gene mutation, method, application and kit

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陆荫英;赵海涛;程家敏;姬峻芳;: "肝胆肿瘤分子诊断临床应用专家共识" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112280863A (en) * 2020-11-06 2021-01-29 南京普恩瑞生物科技有限公司 Method and kit for effectiveness of targeted drug apatinib
CN112280863B (en) * 2020-11-06 2024-01-12 南京普恩瑞生物科技有限公司 Method and kit for targeting drug apatinib effectiveness
CN113707223A (en) * 2021-04-21 2021-11-26 吴安华 Gene set system and method for predicting activity state and treatment sensitivity of tumor inflammasome
CN113555070A (en) * 2021-05-31 2021-10-26 宋洋 Machine learning algorithm for constructing drug sensitivity related gene classifier of acute myeloid leukemia
CN113628684A (en) * 2021-08-06 2021-11-09 苏州鸿晓生物科技有限公司 Sample bacterial species detection methods and systems
CN115472216A (en) * 2022-11-14 2022-12-13 神州医疗科技股份有限公司 Data integration-based cross-adaptive tumor drug combination recommendation method and system
CN116597902A (en) * 2023-04-24 2023-08-15 浙江大学 Method and device for screening multiple groups of chemical biomarkers based on drug sensitivity data
CN116597902B (en) * 2023-04-24 2023-12-01 浙江大学 Method and device for screening multiple groups of chemical biomarkers based on drug sensitivity data
CN116863998A (en) * 2023-06-21 2023-10-10 扬州大学 Genetic algorithm-based whole genome prediction method and application thereof
CN116863998B (en) * 2023-06-21 2024-04-05 扬州大学 Genetic algorithm-based whole genome prediction method and application thereof

Also Published As

Publication number Publication date
CN111640508B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN111640508B (en) Method and application of pan-tumor targeted drug sensitivity state assessment model constructed based on high-throughput sequencing data and clinical phenotypes
Zhang et al. An immune-related signature predicts survival in patients with lung adenocarcinoma
CN111863137B (en) Complex disease state evaluation method based on high-throughput sequencing data and clinical phenotype construction and application
JP7368483B2 (en) An integrated machine learning framework for estimating homologous recombination defects
CN111863126B (en) Method for constructing colorectal tumor state evaluation model and application
US11335463B2 (en) Cancer evolution detection and diagnostic
TWI814753B (en) Models for targeted sequencing
CN111816315B (en) Pancreatic duct cancer state assessment model construction method and application
Pass et al. Biomarkers and molecular testing for early detection, diagnosis, and therapeutic prediction of lung cancer
Barefoot et al. Detection of cell types contributing to cancer from circulating, cell-free methylated DNA
Abraham et al. Machine learning analysis using 77,044 genomic and transcriptomic profiles to accurately predict tumor type
EP4118653A1 (en) Methods for classifying genetic mutations detected in cell-free nucleic acids as tumor or non-tumor origin
Wu et al. Identification and validation of an immune-related RNA signature to predict survival of patients with head and neck squamous cell carcinoma
Rathi et al. A transcriptome-based classifier to determine molecular subtypes in medulloblastoma
Hobbs et al. Biostatistics and bioinformatics in clinical trials
US20220389513A1 (en) A Method of Estimating a Circulating Tumor DNA Burden and Related Kits and Methods
Ahmad et al. A review of genetic variant databases and machine learning tools for predicting the pathogenicity of breast cancer
De Groot et al. Multigene sets for clinical application in glioma
Sato et al. Biostatistic tools in pharmacogenomics-advances, challenges, potential
Shroff et al. Gene co-expression analysis predicts genetic variants associated with drug responsiveness in lung cancer
Wang et al. Terminal modifications independent cell-free RNA sequencing enables sensitive early cancer detection and classification
Jones Genomics and bioinformatics in biological discovery and pharmaceutical development
CN113257354B (en) Method for mining key RNA function based on high-throughput experimental data mining
EP4318493A1 (en) Artificial-intelligence-based method for detecting tumor-derived mutation of cell-free dna, and method for early diagnosis of cancer, using same
Hao et al. Establishing a Prognostic Model in Prostate Adenocarcinoma through Comprehensive scRNA-Seq and Bulk RNA-Seq Analysis and Validation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220128

Address after: 200032 Shanghai Xuhui District Xietu Road No. 2140

Applicant after: Shanghai Institute of biomedical technology

Address before: 201203 floor 2, No. 1278, Keyuan Road, Pudong New Area, Shanghai

Applicant before: SHANGHAI CENTER FOR BIOINFORMATION TECHNOLOGY

GR01 Patent grant
GR01 Patent grant