CN107103207B - Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method - Google Patents

Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method Download PDF

Info

Publication number
CN107103207B
CN107103207B CN201710218630.XA CN201710218630A CN107103207B CN 107103207 B CN107103207 B CN 107103207B CN 201710218630 A CN201710218630 A CN 201710218630A CN 107103207 B CN107103207 B CN 107103207B
Authority
CN
China
Prior art keywords
case
variation
model
knowledge base
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710218630.XA
Other languages
Chinese (zh)
Other versions
CN107103207A (en
Inventor
陈新
张嘉宁
王纬韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201710218630.XA priority Critical patent/CN107103207B/en
Publication of CN107103207A publication Critical patent/CN107103207A/en
Application granted granted Critical
Publication of CN107103207B publication Critical patent/CN107103207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • G06F19/325
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Abstract

The invention discloses a precise medical knowledge searching system based on multi-case-science variation characteristics and an implementation method thereof, wherein the system comprises: the method is realized by the steps of establishing a multi-precise medical knowledge base based on a multiomic variation-intervention response association model, extracting multiple groups of mathematical variation characteristics of a new case, establishing a matching algorithm between the new case and the model to generate an analysis report of a case matching system, updating data of the knowledge base and self-evolving of the matching algorithm. The invention systematically integrates the correlation between known omic variation and intervention response, and integrates the intervention response and omic variation information of different levels and sources into a knowledge base by defining a generalized framework of a plurality of groups of mathematical variation-intervention response correlation model types.

Description

Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method
Technical Field
The invention belongs to the field of medical and health industries, and relates to an implementation method of an accurate medical knowledge search system, in particular to an implementation method of an accurate medical knowledge search system based on multi-case-science variation characteristics.
Background
Accurate medical treatment relies on the classification of biomarkers for disease risk, prognosis, and response to treatment. The rapid development of omics technology greatly enriches the quantity of biomarkers at the molecular level, and provides more comprehensive and detailed judgment basis for disease diagnosis, disease stage judgment or evaluation of the safety and effectiveness of new therapies in target groups.
Currently, the related information of the molecular level marker or the pathological change characteristic-intervention response (including drug response) can be mainly obtained from several channels such as concomitant diagnosis, high-throughput drug screening experiments at the cell line level, precise medical clinical tests and the like. The related information provided by the diagnosis is obtained under the observation of the population level of a large population, and the information is directly and easily obtained. However, the correlation information provided by the cell line drug screening experiment and the accurate medical experiment needs to process the original information, and the correlation between the variation at the molecular level and the intervention response can be established by extracting a plurality of groups of chemical variation characteristics of omics data. The fact that different sources and types of associated information are confounded increases the difficulty for many clinicians to interpret the physiological significance of a characteristic of a pathological variation and to exploit the clinical value.
In addition, the integration and clinical transformation of omics data need to take into consideration the stability of data, experimental platform (e.g., different laboratories or institutions), observation scale (e.g., cell line level, tissue level, individual level, etc.), observation mode (e.g., transcriptome level, proteome level, or genome level, etc.), observation means (e.g., single nucleotide polymorphism chip, second generation sequencing technology, etc.), etc. all of which may cause instability of the observed behavior of the same biomarker. Therefore, how to integrate the related information to the maximum extent and make them exert the maximum effect still needs to be solved urgently.
Disclosure of Invention
The invention aims to utilize observable individual multiomic variation characteristics to quickly search a knowledge base for a multigroup variation-intervention response correlation model successfully matched with a new case, and present intervention strategies corresponding to all successfully matched models and records of whether successful response is realized to a user in an easily readable and compact integrated form, and the invention is realized by the following technical scheme:
the invention discloses an accurate medical knowledge search system based on case multigroup variation characteristics, which comprises:
the accurate medical knowledge base is used for collecting a multi-omic variation-intervention response correlation model, and realizes the collection and integration of omic variation characteristic-intervention response information of different levels;
an optimizable matching algorithm for judging whether the case is matched with the model in the knowledge base and the matching degree;
the evaluation algorithm of the matching algorithm is used for evaluating the clustering result of the knowledge base model by the matching algorithm and comparing the clustering result with the result obtained by classifying the model according to the label of the intervention response, so that the quality of the matching algorithm can be evaluated, and the algorithm is continuously optimized;
the report which is directly generated by the search system and contains the analysis data of the case omics and the search result of the system is used for providing the physiological meaning reference of the omics data for doctors and assisting in the drawing up of treatment schemes.
As a further improvement, the different levels of the invention include population levels, individual levels, tissue levels and cell line levels.
The invention also discloses an implementation method of the accurate medical knowledge search system based on the multi-case-science variation characteristics, which is implemented by the following steps:
1) establishing a multi-precision medical knowledge base based on a multi-group chemical variation-intervention response correlation model;
2) when a new case appears, extracting multiple groups of mathematical variation characteristics of the new case;
3) establishing a matching algorithm between the new case and the model (known multigroup chemical variation characteristic-intervention response association);
4) generating an analysis report of a case matching system;
5) data update of the knowledge base and self-evolution of the matching algorithm.
As a further improvement, in step 1) of the present invention, the multiple sets of chemical variation information include single base mutations (single nucleotide polymorphisms and base insertion deletions) in the transcriptionally active genomic region, chromosomal variations (e.g., gene fusions), and reference gene expression levels used to determine whether a gene is abnormally expressed.
As a further improvement, in step 1) of the present invention, a omics variant-intervention response correlation model is a set of "companion diagnosis correlation models" with companion diagnosis drug response annotations and multiple sets of mathematical variant features, or a "cell line correlation model" containing drug response information and multiple sets of mathematical variant features in drug screening experiments, or a "case correlation model" containing intervention response results and multiple sets of mathematical variant features observed in clinic, or a "individualized disease model correlation model" containing drug screening result information and multiple sets of mathematical variant features. The individualized model includes but is not limited to a PDX mouse and a PDO organoid model.
As a further improvement, in the step 2) of the present invention, the multi-genomic variation characteristics include single base mutation, chromosomal structure variation, and gene expression abnormality information in the transcriptionally active genomic region.
As a further improvement, in step 2), a set of standardized omics data analysis process is established to extract multiple groups of chemical variations, and quality control and quality assurance are performed in the whole process from sample collection, sequencing and data analysis to knowledge base matching.
As a further improvement, in step 3) of the present invention, the search system provides an initial matching algorithm and an evaluation method for the matching algorithm, and the evaluation method will evaluate whether the existing algorithm is better than the new algorithm according to the clustering performance of the association models in the knowledge base using different matching algorithms, and determine whether the algorithm needs to be upgraded and optimized.
As a further improvement, in step 4), the report is divided into two parts: the first part is the statistical information display of the multi-group chemical variation characteristics related to the physiological of the case, and the omics variation information of the lesion tissues is given from the aspects of single base mutation, chromosome variation, differential expression genes and the like; and the second part is that after the searching of the knowledge base is completed, the matching evidence and the medication information of the model are displayed according to the similarity of the model and the case in the system from high to low.
As a further improvement, in the step 5), after the case is subjected to the step 2) of omics feature extraction, the treatment effect of the case medication is tracked, the case data is used as a case model and added into an accurate medical knowledge base, the coverage range of the knowledge base is expanded, and the matching precision of the knowledge base is improved; when the matchable associated model is not searched in the knowledge base, the patient can be treated directly according to the experience of doctors, the individualized disease model can be established according to the developed case, the treatment effect of the case and the test result of the individualized disease model are tracked, and the corresponding case associated model or the individualized disease model associated model is established and added into the accurate medical knowledge base.
The invention has the advantages that:
1) the invention has wide search range and can search the correlation models under different observation scales. The invention systematically integrates the correlation between known omic variation and intervention response, and integrates the intervention response and omic variation information of different levels and sources into a knowledge base by defining a generalized framework of a plurality of groups of mathematical variation-intervention response correlation model types.
2) The invention has rich available matching characteristics and matching strategies. On one hand, the reliability of the matching result is ensured by cooperatively matching the multiomic variant characteristics from multiple aspects of single base variation, chromosomal variation, differentially expressed genes and the like, and the noise in the correlation analysis of a single variant type and a physiological phenotype is reduced. On the other hand, the invention provides specific and optimizable matching strategies for intervention response models with different scales in the knowledge base respectively, and provides multi-angle evidence support for the relation between the case and the intervention response through the association model.
3) The invention has self-evolution capability. This ability is manifested in two ways: firstly, the number of models in the accurate medical knowledge base is continuously expanded along with the operation of a search system. After a new case enters, the system records the multi-group mathematical variation characteristics of the case, and generates an association model of the case and adds the association model into a multi-accurate medical knowledge base by combining a subsequent treatment scheme and an intervention response result of the case or a medication result of an individual disease model of the case. And secondly, the matching algorithm of the system can be continuously optimized. The invention establishes a corresponding evaluation method aiming at the matching algorithm. Once the matching algorithm is updated, the new matching algorithm can be used to re-cluster the models in the knowledge base, compare with the classification based on the intervention response tags, and determine whether the system needs to be updated by evaluating whether the new algorithm is superior to the existing algorithm.
4) The method fills the blank between the step of extracting the omic variation information and the step of clinical guiding and using the medicine, and assists clinical staff in systematically reading the physiological significance of the omic variation and mining the clinical value.
Drawings
Fig. 1 is a schematic flow chart of the implementation of the technical scheme of the invention.
Detailed Description
The invention establishes an accurate medical knowledge search system based on an individual case multigroup variation collaborative matching method. The system of the invention comprises: firstly, an accurate medical knowledge base is contained. The knowledge base realizes the collection and integration of omic variation characteristic-intervention response information of different levels (population level, individual level, tissue level, cell line level and the like) by collecting a multiomic variation-intervention response correlation model. Individual cases entering the system can be used as new models for the augmentation of the knowledge base; and secondly, an optimizable matching algorithm is included. The initial matching algorithm provided by the system does not exert the advantage of rich omics variation to the maximum extent, but the invention provides an evaluation method of the matching algorithm, the advantages and disadvantages of the matching algorithm can be evaluated by comparing the clustering result of the knowledge base model with the result obtained by classifying the model according to the label of the intervention response through the evaluation matching algorithm, and the algorithm is continuously optimized; and thirdly, the search system directly generates an easily-readable report containing the analysis data of the case omics and the search result of the system, so that the physiological meaning reference of the omics data can be provided for doctors, and the drawing of a treatment scheme is assisted.
The basic modes of this invention are: firstly, a multi-precise medical knowledge base based on a multiomic variation-intervention response correlation model is established. The multiomic variation information comprises three aspects of single base mutation (single nucleotide polymorphism and base insertion deletion), chromosome variation (such as gene fusion) and reference gene expression quantity for judging whether the gene is abnormally expressed. A multiomic variant-intervention response correlation model may be a set of "companion diagnostic correlation models" with companion diagnostic drug response annotations and multiple sets of mathematical variant features; or a cell line correlation model containing drug response information and multi-group chemical variation characteristics in a drug screening experiment; or a clinically observed "case association model" comprising intervention response results and multigroup mathematical variation features; it may also be a "personalized disease association model" (including but not limited to PDX mice, PDO organoid models) containing drug screening result information and multi-set mathematical variation features. And secondly, when a new case appears, extracting multiple sets of mathematical variation characteristics (including but not limited to single base mutation, chromosome structure variation and gene expression profile information) of the new case. And establishing a set of standardized omics data analysis flow to extract multiple groups of chemical variations, and performing quality control and quality assurance in the whole process from sample collection, sequencing and data analysis to knowledge base matching. And thirdly, establishing a matching algorithm between the new case and the association model. The search system provides an initial matching algorithm and an evaluation method aiming at the matching algorithm, and the evaluation method can evaluate whether the existing algorithm is superior to the new algorithm or not according to the clustering performance of the association models in the knowledge base by using different matching algorithms and decide whether the algorithm needs to be upgraded and optimized or not. Fourthly, generating a personalized report of the case. The report is divided into two parts: the first part is the statistical information display of the multi-group chemical variation characteristics related to the physiological of the case, and the omics variation information of the lesion tissues is given from the aspects of single base mutation, chromosome variation, differential expression genes and the like; and the second part is that after the searching of the knowledge base is completed, the matching evidence and the medication information of the model are displayed according to the similarity of the model and the case in the system from high to low. And fifthly, if the case is not matched with the existing model, the medicine is taken directly according to the experience of a doctor, meanwhile, an individual disease treatment model based on the case can be developed for medicine screening, a case association model and an individual disease association model are built for the case according to a feedback result, and the case association model and the individual disease association model are added into a knowledge base.
Fig. 1 is a schematic view of an implementation flow of the technical scheme of the present invention, and the specific implementation steps are as follows:
1) constructing an accurate medical knowledge base based on a multigroup chemical variation-intervention response correlation model: establishing an intervention response model with different scales (including but not limited to a population level, an individual level, a tissue level and a cell line level), and collecting multiple groups of chemical variant characteristics and corresponding intervention and intervention response information from the aspects of ' population chemical variant characteristics-intervention response ', ' individual case chemical variant characteristics-intervention response ', ' individual disease model (such as a PDX mouse, a PDO model and the like) ' chemical variant characteristics-intervention response ' and ' cell line chemical variant characteristics-intervention response '. The data in the knowledge base is obtained by means of web crawler capture, public database download, local data import (case and individualized disease model) and the like. The obtained data needs to be subjected to word segmentation, semantic analysis, regular matching and other technologies to extract core keywords and data, then format conversion is carried out, original information is mapped to an information standardization interface with clinical intervention design reference value, and the information standardization interface is added into a database after manual correction. The data of the same type of association model in the database has a uniform information storage format;
2) constructing a process for extracting the multi-group mathematical variation characteristics of the cases: and (3) constructing a bioinformatics analysis process based on a second-generation sequencing technology, extracting genes with single base mutation, genome structure mutation and abnormal expression of transcription level closely related to physiological change from omics data, and taking the genes as the multi-group chemical variation characteristics of a case for matching with models in a multi-group chemical variation characteristic database. Strict quality control is used in the data analysis process of the cases, and under the condition that normal control samples are available, the normal samples and known disease-omics variation information are used for screening the metaomics variation of the cases, so that the reliability of the association of the metaomic variation characteristics of the cases and the physiological phenotype is improved;
3) the case-model multigroup variation collaborative matching algorithm is realized: the accurate medical knowledge base integrates variation characteristic information of the correlation model of multiple data sources and multiple mathematical angles. When a case finishes the extraction of multi-group variation characteristics and enters a case matching system, the case and the model need to be matched according to the type of the model in the knowledge base. When the model is matched with a specific correlation model, different methods are respectively used for matching and scoring the variation characteristics extracted from a case and the variation characteristics of the model aiming at different omics variation characteristics, finally, the scores of the different variation characteristics are used for generating a matching total score of a case-drug response model according to a formula, and whether the case can be matched with the model or not is judged according to the total score;
4) generating an analysis report for the case matching system: the report is divided into two layers: a first layer: omics information report of individual cases. Including but not limited to raw data sequencing quality information, data analysis process introduction, statistical information of multiple sets of mathematical variation characteristics; a second layer: and matching the case with the model in the accurate medical knowledge base. And according to the search result, displaying the information such as the intervention strategy, the response result, the matching evidence and the like of the model in the system from high to low in similarity of the model and the case. The second layer provides readable information of 'individual case omics variation characteristics-model omics variation characteristics-intervention response', provides potential intervention response information of cases to assist doctors in interpreting physiological significance of omics variation characteristics and mining clinical value of omics data;
5) updating of the search system: the updating of the system is divided into two parts of data updating of a knowledge base and self-evolution of a matching algorithm.
Firstly, updating a knowledge base: when the case is matched with the model in the knowledge base, the treatment effect of the case medication is tracked, the case data is used as a case model and added into the accurate medical knowledge base, the coverage range of the knowledge base is expanded, and the matching accuracy of the knowledge base is improved. When the matched associated model is not searched in the knowledge base, the patient can be treated directly according to the experience of doctors, an individualized disease model (such as a PDX mouse or a PDO organ model) can be established by developing the case, the intervention response result of the case and the reagent result of the individual disease model are tracked, and the corresponding case associated model or the individual disease associated model is established and added into the accurate medical knowledge base.
Secondly, self-evolution of a matching algorithm: the system establishes an evaluation method for comparing the advantages and disadvantages of the new matching algorithm and the old matching algorithm to optimize the system matching algorithm. When the system is put into operation, an initial matching algorithm to be optimized is provided first. With the expansion of new cases, models in the accurate medical knowledge base are increased continuously, and resources are provided for optimizing a matching algorithm. According to the response classification of the models in the knowledge base to the intervention, M correlation models can be randomly selected, new and old matching algorithms are respectively used for scoring between every two selected models, and two similarity scoring matrixes formed by the models are obtained. Further clustering the matrix, the model classification conditions obtained by the new and old matching algorithms can be obtained, and the classification results are compared with the real classification results according to the drug response information, so that whether the new algorithm is superior or not is judged, and the current algorithm of the system can be replaced.
The technical solution of the present invention is further illustrated by the following specific examples:
example 1: cancer case rapid matching system based on case transcriptome variation characteristics
The present embodiment consists of five major steps:
1) constructing a multi-precise medical knowledge base: the knowledge base takes a correlation model as a storage object, and collects multiple groups of chemical variation characteristics related to drug response information from three data sources, namely a list of concomitant diagnostic drugs approved by the Food and Drug Administration (FDA), refined Cancer medical information provided by the My Cancer Genome and a GDSC database of the Sanger institute. The companion diagnostic drugs and My Cancer Genome provide population-level omic variability features-drug response information, and the GDSC database provides cell line-level specific omic variability features-drug response information. And data in different formats are uniformly managed in a naming mode provided by an international standard database. In this example, single base mutations from different sources are mapped to corresponding names in the COSMIC database, with the nomenclature in the database as the standard output. Similarly, the gene name was normalized to the entrez ID of NCBI and the disease name was normalized to OMIM ID.
2) Extracting the characteristics of multi-group pathological variation of cases: a bioinformatics analysis process based on transcriptome sequencing (RNA-Seq) data is set up, genes with single base mutation, chromosome structure mutation and abnormal expression of transcription level closely related to physiological change are extracted from transcriptome data, and the genes are used as multigenology variation characteristics of a case and are used for matching with models in a multigenology variation characteristic database.
In this example, the variant extraction process can be divided into the following sections: RNA-Seq data pretreatment, single base mutation detection (single nucleotide polymorphism, small fragment insertion and deletion), chromosome structure variation detection (gene fusion), gene expression and abnormal expression gene detection, and visual display of results.
Firstly, RNA-Seq data pretreatment:
raw data was checked for data quality using a quality control tool, and data was examined by subsequent excision of adaptor sequences and head-to-tail low quality bases in reads using de-adaptor software. The washed reads were used for the next sequence alignment. Here, the example uses fast short-fragment alignment software and the human genome as a reference genome for alignment.
Secondly, detecting single base mutation of a case:
this example was performed at this step according to the best practice protocol for RNA-seq mutation detection provided by GATK (http:// gatkformers. branched infection. org/GATK/discission/3892/the-GATK-best-plasmids-for-variant-calling-on-rnaseq-in-full-detail). Removing redundant reading segments from the file obtained by the comparison in the step 1, then carrying out tail cutting treatment on the reading segments, disassembling the reading segments according to exon segments, carrying out base correction, detecting single nucleotide polymorphism and single nucleotide insertion deletion, and finally annotating and filtering the detected single base variation by using variation annotation software by utilizing human genome variation database resources.
Thirdly, detecting the chromosome variation of the cases:
the structural variation that can be detected by transcriptome sequencing data is mainly gene fusion. Here, the alignment results in pair 1. the gene fusion events seen on the transcriptome were detected using gene fusion software.
Fourthly, detecting the gene expression level:
this step also uses the alignment file in 1 as the input file of the segment splicing and assembling software for the splicing of the transcript and the calculation of the expression amount. In this example we only consider the case where no para-cancerous tissue is provided and there is no para-cancerous tissue in the disclosed cancer transcriptome database.
And fifthly, visualization display of the result of the metanomics data:
the global metanomic variability profile of individual cases is shown by circled plots. The circle is composed of four parts from inside to outside, the innermost part shows the occurrence position of the gene fusion event, the subsequent part shows the occurrence position of the single base mutation event, the subsequent part shows the expression of the gene in the whole transcriptome, and the outermost part is annotated chromosome position information.
Various statistical graphs generated in the analysis process, such as a scatter diagram, a histogram, a pie chart and the like, are visually output through statistical software R.
3) Implementation of case-model multigroup variation collaborative matching algorithm: the multi-omics variation characteristic database integrates variation characteristic information of a multi-data-source and multi-group association model from a plurality of mathematical angles. When a case finishes the extraction of multi-group variation characteristics and enters a case matching system, a case-model matching algorithm needs to be provided according to the type of a model in a database.
In this example, the knowledge base provides three types of models: 1. a companion diagnostic correlation model; 2. a cell line association model; 3. a case association model.
The intervention results given by the association model at the population level are typically the effect on drug response in a large population sample for a particular omic variant feature or features. Thus, the example employed the strategy for this model to compare cases with a population model if they have identical omic variant features, reporting that the case matches the population model successfully, otherwise the match fails.
Both the association model at the cell line level and the association model at the individual level provide complete single base mutation, chromosomal structural mutation and gene expression profiling information. Therefore, the similarity scoring method combining the three information is adopted by the embodiment to measure the similarity between the case and the model. The difference between the case matching using the association model at the cell line level and the association model at the individual level is the threshold parameter that ultimately determines whether the matching is successful. The method comprises the following implementation steps:
firstly, aiming at single base mutation: the example uses the DANN method to measure the functional importance of single base mutation in cases and models, and sums the DANN values of the sites where significant single base functional mutation occurs on each gene in cases and models, respectively, to measure the influence of single base functional mutation on physiology on the gene. The similarity score of the functional mutation of the gene between the case and the model can be obtained by the formula 1- | Csnv-Msnv|/Max{Csnv,MsnvIs obtained, wherein CsnvThe functional mutation influence value of a certain gene in a case, MsnvIs the functional mutation impact value of the model. The score can be used as an index for measuring the similarity of gene functions of the cases and models, V1.
Secondly, aiming at the chromosome structure variation: at present, no method for directly measuring the physiological influence degree of gene fusion exists. Considering that structural variation usually has a very serious influence on the physiological function of the gene, the example uses a customized index V2(0 or 1) to measure the similarity between the case and the sample in the gene fusion event. If gene fusion occurs or does not occur in a certain gene in the case and the model, the V2 value is 1, otherwise the V2 value is 0.
Thirdly, aiming at the abnormally expressed gene: this example defines an index V3 for measuring abnormal expression of gene, where V3 is 1- | Cexp-Mexp|/Max{Cexp,MexpIn which C isexpAnd MexpThe expression level of a certain gene in a case and a model after the expression profile is subjected to standardization treatment is respectively shown.
In this example, considering that gene expression abnormality reflects variation at the transcription level, single base mutation or chromosomal structure variation reflects variation on the genome, and therefore it is necessary to integrate the effects of both in integrating these indices. The final similarity score between the case and the model in the example for a certain gene is defined as V Min { V3V 1, V3V 2}, wherein V1, V2 and V3 are the three similarity indexes mentioned above in the description. If the similarity score for a particular gene is higher than 0.5, the gene is considered to be consistently present in cases and models. Matching a case with a model is considered successful when more than half of the genes in the case show the same expression as they do in the model, otherwise the matching is considered to fail.
4) Generating an analysis report according to the matching result of the case:
the analysis report display is mainly divided into two parts: and displaying the individual case information and the search result of the knowledge base.
The individual case information presentation in this example comprises:
1. sequencing sample basic information (including sample name, sample sending time, sequencing time, sequencer model, sample label and data saturation evaluation parameters);
2. overall omics data display map, statistical information of transcriptome sequencing data (including original number of reads of sample, number of reads after washing, number of reads compared to reference genome, number of reads in specific comparison);
3. detecting an expression distribution histogram of the expressed genes, and a graph of the differentially expressed genes;
4. counting the number of single base variation and structural variation on the genome and reading the format of the variation file;
5. the QC report position of original data, the expression file positions of genes and transcripts, the file positions of differentially expressed genes, the file positions of single base variation information and the file positions of gene fusion information.
Knowledge base search result presentation comprises:
1. basic information of the model on the match (model type, original data source, model name, disease name, etc.);
2. evidence supporting the matching of the case to the model (type of the index on the matching in the model and the case, index name, metric value of the index, etc.);
3. clinical medication reference information of the matched model (drug name, whether the model responds to the drug, etc.)
5) Self-evolution of search systems:
firstly, updating a precise medical knowledge base: tracking the case entering the knowledge base for analysis, establishing a case omics variability characteristic-intervention response correlation model according to the case compliance medical treatment effect and long-term outcome, and adding the model into the knowledge base. For the case that the matched model is not searched when the case enters the knowledge base for the first time, establishing an individualized disease model (a PDX mouse model or a PDO organ model) is considered, establishing an individualized disease model omics variation characteristic-drug response correlation model according to the reaction of the in-vitro individualized disease model to different drugs, and adding the individualized disease model omics variation characteristic-drug response correlation model into the knowledge base.
Secondly, self-evolution of a matching algorithm: when the number of a certain type of models in a knowledge base in a search system is accumulated to a certain value, M types of models can be randomly selected, and classified according to the response of the models to drugs for evaluation of a matching algorithm aiming at the types of models. When a new case is matched with the model, the evaluation results of the new matching algorithm and the old matching algorithm can be compared. If the consistency of the new method and classification according to the response of the medicine is higher, the application effect of the new matching algorithm under the real situation is better, the matching algorithm is updated, otherwise, the original algorithm is better in performance, and the algorithm is abandoned.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the present invention is not limited to the above embodiments, but many variations are possible, and all variations that can be derived or suggested by a person skilled in the art from the disclosure of the present invention should be considered as the protection scope of the present invention.

Claims (9)

1. An accurate medical knowledge search system based on case multigroup variation characteristics, which is characterized in that the system comprises:
the accurate medical knowledge base is used for collecting a multi-omic variation-intervention response correlation model, and realizes the collection and integration of omic variation characteristic-intervention response information of different levels;
an optimizable matching algorithm for judging whether the case is matched with the model in the knowledge base and the matching degree;
the evaluation algorithm of the matching algorithm is used for evaluating the clustering result of the knowledge base model by the matching algorithm and comparing the clustering result with the result obtained by classifying the model according to the label of the intervention response, so that the quality of the matching algorithm can be evaluated, and the algorithm is continuously optimized;
the report which is directly generated by the search system and contains the analysis data of the case omics and the search result of the system is used for providing the physiological meaning reference of the omics data for doctors and assisting in the drawing up of treatment schemes.
2. The system for refined medical knowledge search based on the characteristics of multigroup-case variation according to claim 1, wherein the different levels include population level, individual level, tissue level and cell line level.
3. An implementation method of the case multigroup variation feature-based precise medical knowledge search system according to claim 1 or 2, wherein the implementation method is implemented by the following steps:
1) establishing an accurate medical knowledge base based on a multi-group chemical variation-intervention response correlation model;
2) when a new case appears, extracting multiple groups of mathematical variation characteristics of the new case;
3) establishing a matching algorithm between the new case and the multigroup chemical variation-intervention response correlation model;
4) generating an analysis report of a case matching system;
5) data update of the knowledge base and self-evolution of the matching algorithm.
4. The method for implementing the system for searching for precise medical knowledge based on the case multiomic variation characteristics of claim 3, wherein the multiple sets of mathematical variation information in step 1) comprise single-base mutation, chromosomal variation and reference gene expression level for determining whether the gene is abnormally expressed or not in the transcriptionally active genomic region.
5. The method for implementing the case-based omics variant feature refined medical knowledge search system as set forth in claim 3, wherein in step 1), the one omics variant-intervention response correlation model is a set of "companion diagnosis correlation models" with companion diagnosis drug response annotations and multiple sets of mathematical variant features, or a "cell line correlation model" containing drug response information and multiple sets of mathematical variant features in drug screening experiments, or a "case correlation model" containing intervention response results and multiple sets of mathematical variant features observed in clinic, or a "individualized disease model correlation model" containing drug screening result information and multiple sets of mathematical variant features.
6. The method for implementing the system for searching for precise medical knowledge based on the multimathematic mutation characteristics of cases according to claim 4 or 5, wherein in the step 2), a set of standardized omics data analysis process is established to extract multimathematic mutation, and the quality control and quality assurance are performed in the whole process from sample collection, sequencing, data analysis to knowledge base matching.
7. The method for implementing the system for searching for precise medical knowledge based on the multigroup mathematical variation features of cases as claimed in claim 6, wherein in the step 3), the searching system provides an initial matching algorithm and an evaluation method for the matching algorithm, and the evaluation method will evaluate whether the existing algorithm is better than the new algorithm according to the clustering performance of the association models in the knowledge base by using different matching algorithms to decide whether the algorithm needs to be upgraded and optimized.
8. The method for implementing the system for searching for precise medical knowledge based on the characteristics of the multigroup mathematical variation of cases according to claim 4, 5 or 7, wherein in the step 4), the report is divided into two parts: the first part is the statistical information display of the multi-group chemical variation characteristics related to the physiological of the case, and the omics variation information of the lesion tissues is given from the aspects of single base mutation, chromosome variation, differential expression genes and the like; and the second part is that after the searching of the knowledge base is completed, the matching evidence and the medication information of the model are displayed according to the similarity of the model and the case in the system from high to low.
9. The method for implementing the system for searching for precise medical knowledge based on case multiomic variation characteristics according to claim 3, wherein in the step 5), when a case is matched with a model in the knowledge base, the effect of medication of the case is tracked, a plurality of groups of data of the mathematical variation characteristics and the effect of medication of the actual case are added into the precise medical knowledge base as a case correlation model, the coverage of the knowledge base is expanded, and the matching precision of the knowledge base is improved; when the matchable associated model is not searched in the knowledge base, the patient can be treated directly according to the experience of doctors, the individualized disease model can be established according to the developed case, the case intervention response result and the test drug result of the individual disease model are tracked, and the corresponding case associated model or the individualized disease model associated model is established and added into the accurate medical knowledge base.
CN201710218630.XA 2017-04-05 2017-04-05 Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method Active CN107103207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710218630.XA CN107103207B (en) 2017-04-05 2017-04-05 Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710218630.XA CN107103207B (en) 2017-04-05 2017-04-05 Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method

Publications (2)

Publication Number Publication Date
CN107103207A CN107103207A (en) 2017-08-29
CN107103207B true CN107103207B (en) 2020-07-03

Family

ID=59675265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710218630.XA Active CN107103207B (en) 2017-04-05 2017-04-05 Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method

Country Status (1)

Country Link
CN (1) CN107103207B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108320797B (en) * 2018-01-18 2022-03-08 中山大学 Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN108335748A (en) * 2018-01-18 2018-07-27 中山大学 A kind of nasopharyngeal carcinoma artificial intelligence assisting in diagnosis and treatment policy server cluster
CN108509771B (en) * 2018-03-27 2020-12-22 华南理工大学 Multi-group chemical data association relation discovery method based on sparse matching
CN109599157B (en) * 2018-11-29 2020-10-02 同济大学 Accurate intelligent diagnosis and treatment big data system
CN110656172A (en) * 2019-01-14 2020-01-07 南方医科大学珠江医院 Molecular marker and kit for predicting sensitivity of small cell lung cancer to EP chemotherapy scheme
CN110379460B (en) * 2019-06-14 2023-06-20 西安电子科技大学 Cancer typing information processing method based on multiple sets of chemical data
CN110660055B (en) * 2019-09-25 2022-11-29 北京青燕祥云科技有限公司 Disease data prediction method and device, readable storage medium and electronic equipment
CN112070731B (en) * 2020-08-27 2021-05-11 佛山读图科技有限公司 Method for guiding registration of human body model atlas and case CT image by artificial intelligence
CN112053783A (en) * 2020-08-27 2020-12-08 北京颢云信息科技股份有限公司 Disease intelligent prediction modeling method based on multiple groups of mathematical data

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1547721A (en) * 2001-08-28 2004-11-17 System, method, and apparatus for storing, retrieving, and integrating clinical, diagnostic, genomic, and therapeutic data
CN102637245A (en) * 2001-05-25 2012-08-15 株式会社日立制作所 Information processing system using nucleotide sequence-related information
CN103955608A (en) * 2014-04-24 2014-07-30 上海星华生物医药科技有限公司 Intelligent medical information remote processing system and processing method
CN104067278A (en) * 2011-11-18 2014-09-24 加利福尼亚大学董事会 Bambam: parallel comparative analysis of high-throughput sequencing data
CN105229651A (en) * 2013-05-23 2016-01-06 皇家飞利浦有限公司 DNA sequence dna fast and the retrieval of safety
CN105229649A (en) * 2013-03-15 2016-01-06 百世嘉(上海)医疗技术有限公司 For the human genome analysis of variance of disease association and the system and method for report
CN105701342A (en) * 2016-01-12 2016-06-22 西北工业大学 Agent-based construction method and device of intuitionistic fuzzy theory medical diagnosis model
CN105760705A (en) * 2016-05-20 2016-07-13 陕西科技大学 Medical diagnosis system based on big data
CN106202936A (en) * 2016-07-13 2016-12-07 为朔医学数据科技(北京)有限公司 A kind of disease risks Forecasting Methodology and system
CN106227992A (en) * 2016-07-13 2016-12-14 为朔医学数据科技(北京)有限公司 A kind of recommendation method and system of therapeutic scheme

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020064792A1 (en) * 1997-11-13 2002-05-30 Lincoln Stephen E. Database for storage and analysis of full-length sequences
US20110256545A1 (en) * 2010-04-14 2011-10-20 Nancy Lan Guo mRNA expression-based prognostic gene signature for non-small cell lung cancer

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637245A (en) * 2001-05-25 2012-08-15 株式会社日立制作所 Information processing system using nucleotide sequence-related information
CN1547721A (en) * 2001-08-28 2004-11-17 System, method, and apparatus for storing, retrieving, and integrating clinical, diagnostic, genomic, and therapeutic data
CN104067278A (en) * 2011-11-18 2014-09-24 加利福尼亚大学董事会 Bambam: parallel comparative analysis of high-throughput sequencing data
CN105229649A (en) * 2013-03-15 2016-01-06 百世嘉(上海)医疗技术有限公司 For the human genome analysis of variance of disease association and the system and method for report
CN105229651A (en) * 2013-05-23 2016-01-06 皇家飞利浦有限公司 DNA sequence dna fast and the retrieval of safety
CN103955608A (en) * 2014-04-24 2014-07-30 上海星华生物医药科技有限公司 Intelligent medical information remote processing system and processing method
CN105701342A (en) * 2016-01-12 2016-06-22 西北工业大学 Agent-based construction method and device of intuitionistic fuzzy theory medical diagnosis model
CN105760705A (en) * 2016-05-20 2016-07-13 陕西科技大学 Medical diagnosis system based on big data
CN106202936A (en) * 2016-07-13 2016-12-07 为朔医学数据科技(北京)有限公司 A kind of disease risks Forecasting Methodology and system
CN106227992A (en) * 2016-07-13 2016-12-14 为朔医学数据科技(北京)有限公司 A kind of recommendation method and system of therapeutic scheme

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Gene–disease relationship discovery based on model-driven data integration and database view definition》;S. Yilmaz等;《BIOINFORMATICS》;20090228;第25卷(第2期);第230-236页 *
《医疗大数据临床应用的探索与实践》;汪鹏等;《中国数字医学》;20160930;第11卷(第9期);第8-14页 *

Also Published As

Publication number Publication date
CN107103207A (en) 2017-08-29

Similar Documents

Publication Publication Date Title
CN107103207B (en) Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method
Yang et al. Risk prediction of diabetes: big data mining with fusion of multifarious physical examination indicators
Li et al. Decoding the genomics of abdominal aortic aneurysm
Bush et al. Unravelling the human genome–phenome relationship using phenome-wide association studies
CN110021364B (en) Analysis and detection system for screening single-gene genetic disease pathogenic genes based on patient clinical symptom data and whole exome sequencing data
Moni et al. How to build personalized multi-omics comorbidity profiles
CN111192634A (en) Method for processing genomic data
WO2006072011A2 (en) Methods, systems, and computer program products for developing and using predictive models for predicting a plurality of medical outcomes, for evaluating intervention strategies, and for simultaneously validating biomarker causality
Chen et al. Detecting the tipping points in a three-state model of complex diseases by temporal differential networks
CN108121896B (en) Disease relation analysis method and device based on miRNA
KR101693510B1 (en) Genotype analysis system and methods using genetic variants data of individual whole genome
CN112270988A (en) Method and system for auxiliary diagnosis of rare diseases
Liu et al. Exploratory data mining for subgroup cohort discoveries and prioritization
Sinnott et al. PheProb: probabilistic phenotyping using diagnosis codes to improve power for genetic association studies
KR101067352B1 (en) System and method comprising algorithm for mode-of-action of microarray experimental data, experiment/treatment condition-specific network generation and experiment/treatment condition relation interpretation using biological network analysis, and recording media having program therefor
CN111863132A (en) Method and system for screening pathogenic variation
CN112270960B (en) Secondary tumor diagnosis knowledge base and tumor mutation analysis system
CN112071439B (en) Drug side effect relationship prediction method, system, computer device, and storage medium
CN114566221A (en) Automatic analysis and interpretation system for NGS data of genetic diseases
Biswas et al. Big data analytics in precision medicine
KR102483880B1 (en) disease profiling information providing system based on multiple database information and method therefor
Reches et al. From phenotyping to genotyping-bioinformatics for the busy clinician
Rosati et al. Differential gene expression analysis pipelines and bioinformatic tools for the identification of specific biomarkers: A Review
US20230298690A1 (en) Genetic information processing system with unbounded-sample analysis mechanism and method of operation thereof
KR20180090680A (en) Geneome analysis system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant