CN109599157A

CN109599157A - A kind of accurate intelligent diagnosis and treatment big data system

Info

Publication number: CN109599157A
Application number: CN201811444715.0A
Authority: CN
Inventors: 周小波
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-04-09
Anticipated expiration: 2038-11-29
Also published as: CN109599157B

Abstract

The present invention relates to a kind of accurate intelligent diagnosis and treatment big data systems, which includes: management module in data set: learning data with group to more medical institutions' clinic electronic health record data and manage concentratedly；Data preprocessing module: pre-processing the data of centralized management, establishes the interdependent net of relationship based on biometrical features；Marker extraction module: being based on pretreated data, extracts patient characteristic gene and obtains marker collection；Subtypes module: Subtypes are carried out to patient, determine group corresponding to patient；Drug response prediction module: establishing drug response prediction model, predicts reaction of the patient to different pharmaceutical according to drug response prediction model.Compared with prior art, the present invention is able to achieve the effective of medical data and manages and carry out drug response prediction, realizes intelligent.

Description

A kind of accurate intelligent diagnosis and treatment big data system

Technical field

The present invention relates to big data technical fields, more particularly, to a kind of accurate intelligent diagnosis and treatment big data system.

Background technique

China's cancer patient morbidity numbers in 2015 and death toll account for world population respectively up to 429.2 ten thousand and 281.4 ten thousand 22% and 27%.Cause huge burden on society and economic loss.Lung cancer, breast cancer are that China's men and women's number of patients is most respectively Cancer types.Due to the heterogeneity of the diseases such as cancer, variability, the effective percentage of cancer drug only has 25%, and individuation is accurate Medical treatment becomes the only way for further capturing cancer.

" precisely medical treatment " refers to based on personal genome, conjugated protein group, environment in the correlation such as metabolism group Information measures body for patient and designs therapeutic regimen, to reach one kind that therapeutic effect maximizes and side effect minimizes Customize medical model.The development and progress of modern genetic group can provide the something lost of newest nosopathology for pharmaceuticals industry Biography and molecule foundation, provide technical support for the exploitation and personalized medicine of high potency drugs.Especially in oncotherapy, it is different from The conventional method for carrying out patient's parting and therapeutic scheme formulation is checked based on tumor histology, new molecular detecting method passes through To the precision detection of a people's gene, albumen, signal transduction and cancer cell mutation, the disease process of patient can be preferably determined, To propose most effective treatment recommendations.From a long-term perspective, personalized precisely medical treatment is predicted potential by more accurate diagnosis The risk of disease can provide more effective, more targeted treatment, prevent the generation of certain disease, save treatment cost.

Comprehensively extensive group's genomics research, accurate timely molecular marker analyte detection, complex clinical feature with it is more Group learns the individuation Precise Diagnosis of feature, the target drug exploitation for specific molecular biology pathomechanism, is accurate medical treatment Several big key links, and bioinformatics and big data technology are then the skeletons of the entire precisely medical system of support.Disease is controlled Treatment counts the characterization limit of power with clinical path guide far beyond traditional medicine in the complexity of molecular biology scale, Challenge is also even caused to a certain extent to the diagnostic mode for relying primarily on doctors experience.From the discovery of molecular marker with Optimization, to the foundation of medical diagnosis on disease and assessing drug actions prediction model, to the selection of target therapeutic agent and opening for novel drugs target spot Hair is established in the biological wisdom assisting in diagnosis and treatment technology organized on the basis of learning big data and knowledge engineering technology, is all precisely medical obtain With the important support tool of realization.In the accurate special instruction of the great research and development of medical treatment of the Department of Science and Technology issued in the recent period, " will precisely it cure Big data is treated to build using technology and shared platform " one of eight big tasks are classified as, show to establish a powerful biological big data Have become industry common recognition in the importance of accurate medical field with bioinformatics support platform.

How to overcome the high isomerism of medical data and dispersibility, realizes the effective of clinical data between more medical institutions Shared and fusion；How from the magnanimity feature and relatively limited patient's sample of human genome effective marker sieve is carried out Choosing and feature modeling realize patient's exact classification and the therapeutic scheme matching assessment of molecular biology level；How magnanimity is overcome High dimensional feature bring computational complexity is sufficiently excavated and establishes disease-drug-genome three connection rule, realizes treatment Medication effect is effectively predicted, and is the three big significant challenges faced for constructing accurate medical data support platform.

The Subtypes of complex disease such as cancer are a core missions of accurate medical treatment.Traditional Subtypes are mainly Based on histology specificity, clinically there is significant limitation, it is especially past to the effect that end-stage patients carry out classification therapy It is past bad.With popularizing for high throughput experiment, scientists restart based on genome, and transcript profile and epigenetic group are to cancer Disease is classified.Large-scale Genome Project such as TCGA project etc. acquires the molecule of a tumor samples up to ten thousand of various cancers type And genetics characteristics, this is just indicating that cancer patient disaggregated classification is going into the great revolution epoch.Since cancerous tissue is one different Matter, the dynamic system constantly to make a variation, existing research is it has been shown that molecule and genetics characteristics parting cannot be confined to be based on The static classification of a small amount of sample, and need the dynamic analysis based on a large amount of patient's samples that could obtain accurate diagnostic result.Cause This, needs to develop novel big data bioinformatics software packet to solve the following challenge；Integration including clinical data With it is shared, such heterogeneous relation can be reacted in the feature space and data space with clear biology and clinical meaning, Screen these high-dimensional feature spaces effectively to measure intensity and understand the attribute of these relationships, cancer subtypes classification, drug effect is commented Estimate, and researches and develops personalized treatment prediction model to utilize the knowledge services individualized treatment recognized.

Realize that medication effect assessment and prediction towards individual patient are another key challenges of accurate medicine.Although target The individual specific aim of medication is largely improved to the exploitation of drug, however, pharmacy and diagnosis under existing medical treatment system The mode of business separation, causes clinic population's scale involved in drug research and development process very limited, in extensive people after listing The effect applied on group often has larger difference with experimental stage.It is obtained from the association of a large amount of clinical medicine data and gene data The potential molecular mechanism that gene phenotype feature is closely related with the individual difference reacted and cancer prognosis to drug is obtained, in turn Establish prediction model, according to clinical diagnosis and treatment is optimized the characteristics of each patient, be it is final realize precisely medical treatment must be by Road.The biomedical big data of exponential growth provides largely poor to drug susceptibility about cancer patient in all fields Different details easily can carry out multi-angular analysis to the effect of taking of drug by the extraction to these information.It obtains In relation to medication adaptability and the details of clinical effect rule, product renewing is carried out for the clinical application of hospital's specification, medical manufacturer and is changed In generation, provides very valuable information.

All kinds of Clinical symptoms and medical test in the patient disease's development process of clinical electronic health record well-documented history as a result, It is that genomics data are realized to the important tie for being associated with, obtaining accurate diagnosis and treatment proficiency data with clinic diagnosis.However, existing doctor The generally existing record dispersion of the electronic health record for the treatment of system, format disunity are difficult to the defects of shared, and " information island " phenomenon is tight Weight；On the other hand, the level that information excavating utilizes is universal lower, and a large amount of useful informations in electronic health record data are unable to fully mention It takes, causes a large amount of wastes；Finally, the informatization of most of hospital is concentrated mainly in medical profession management, to scientific research The support of purposes is insufficient, and especially clinical medical data library is difficult to realize comprehensive function of search, it is also difficult to incorporate medicine sheet Body language pair information carries out structuring extraction.These problems all limit clinical medical data library in Clinical Decision Support Systems With the realization of clinical test system.

Summary of the invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide a kind of accurate intelligent diagnosis and treatment Big data system.

The purpose of the present invention can be achieved through the following technical solutions:

A kind of accurate intelligent diagnosis and treatment big data system, the system include:

Management module in data set: data are learned with group to more medical institutions' clinic electronic health record data and are managed concentratedly；

Data preprocessing module: pre-processing the data of centralized management, establishes the relationship based on biometrical features Interdependent net；

Marker extraction module: being based on pretreated data, extracts patient characteristic gene and obtains marker collection；

Subtypes module: Subtypes are carried out to patient, determine group corresponding to patient；

Drug response prediction module: establishing drug response prediction model, predicts patient couple according to drug response prediction model The reaction of different pharmaceutical.

Management module is based on i2b2, SCILHS/SHRINE Data Share System to the clinical electricity of more medical institutions in data set Sub- medical record data and group learn data and carry out Dynamic Extraction, dynamic fusion and dynamic data set generation, and then complete data concentrate tube Reason.

The interdependent net of relationship based on biometrical features is the three-dimensional isomery based on patient, cell line and drug Figure.

The marker is concentrated including molecule, cell, into the cell, clinical and demography feature and event.

Subtypes module carries out Subtypes by H-cube algorithm.

H-cube algorithm carries out Subtypes specifically:

(1) it calculates the corresponding marker G-Score value of patient and generates general marker collection, the G-Score value indicates One marker is spent in being rich in for gene set；

(2) Hashing mapping is carried out based on marker G-Score value and the general marker collection of generation；

(3) Hasse tree graph is constructed based on Hashing mapping result；

(4) bidirectional clustering is carried out based on the search of Hasse tree graph and fuzzy matching and completes patient's Subtypes.

The drug response prediction model is based on the drug response of patient-cell strain-drug response three-dimensional dendrogram Prediction model.

Include following prediction process based on patient-cell strain-drug response three-dimensional dendrogram drug response prediction model:

(1) it uses and drug response analysis is carried out with the Algorithms of Non-Negative Matrix Factorization of signature guidance, according to different medicines Object reacts to identify cell line and drug；

(2) it is based on cancer metastasis life span, each patient is mapped to suitable cell line；

(3) the respective signature of patient is found and selected using exhaustive search support vector machines, determines Patient drug Reaction.

Compared with prior art, the present invention has the advantage that

(1) present system is able to achieve the medical data under SHRINE framework and shares, and realizes big data management；

(2) present system learns the Knowledge Discovery of genius morbi marker and the representation of knowledge of big data by clinical and group, Analysis is driven to realize the disease marker relational network knowledge under higher-dimension isomery biomedical data environment by a large amount of online datas It was found that realize that the accurate assisting in diagnosis and treatment of disease lays the foundation；

(3) present invention is being examined by there is supervision to obtain accurately disease subtypes taxonomic structure with unsupervised deep learning Consider under heterogeneous cancer cell, dynamic variation and polygenes, drug interaction, establishes patient-cell line-drug three and close The structure of knowledge of system realizes the effect of drugs Accurate Prediction to patient.

Detailed description of the invention

Fig. 1 is the structural block diagram of accurate intelligent diagnosis and treatment big data system of the present invention.

Wherein, 1 is management module in data set, and 2 be data preprocessing module, and 3 be marker extraction module, and 4 be hypotype Categorization module, 5 be drug response prediction module.

Specific embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.

Embodiment

As shown in Figure 1, a kind of accurate intelligent diagnosis and treatment big data system, the system include:

Management module 1 in data set: data are learned with group to more medical institutions' clinic electronic health record data and are managed concentratedly；

Data preprocessing module 2: pre-processing the data of centralized management, establishes the relationship based on biometrical features Interdependent net；

Marker extraction module 3: being based on pretreated data, extracts patient characteristic gene and obtains marker collection；

Subtypes module 4: Subtypes are carried out to patient, determine group corresponding to patient；

Drug response prediction module 5: establishing drug response prediction model, predicts patient couple according to drug response prediction model The reaction of different pharmaceutical.

One, management module 1 in data set

Management module 1 is based on i2b2, SCILHS/SHRINE Data Share System to the clinical electricity of more medical institutions in data set Sub- medical record data and group learn data and carry out Dynamic Extraction, dynamic fusion and dynamic data set generation, and then complete data concentrate tube Reason.

The present embodiment is related to TCGA database, Wake Forest University's (WFU) clinical breast cancer data set and MDACC (MD Anderson Cancer Research Center) data set, realize the fusion of data in multiple databases.

TCGA database: TCGA project is one of American National key project, and target is to be faced by population information Bed record and newest biotechnology, technique to describe clinical tumor sample comprehensively.The present embodiment focuses on all publish High-level breast cancer, lung cancer, carcinoma of mouth data set (pass through mean value, be segmented, annotation, the data of description, or by with it is original The cross correlation data that data compare), including: somatic mutation, DNA methylation, gene copy number variation, DNA-Seq, MRNA-Seq, miRNA-SEQ, mRNA microarray, demographic, clinical diagnosis, treatment and track record.Individual patient it is more Source isomery biological data will be used for drug repositioning, the optimization of personalized medicine and drug discovery.

Wake Forest University (WFU) clinical breast cancer data set: the set of clinical data covers 1954 patient with breast cancers, They belong to 15 kinds or more of group, and have the experience of nursing in 10 years.Data set specifically includes: 1) by Affymetrix The gene expression atlas of U133 genetic chip microarray platform measurement；2) clinical diagnosis records, including receptor status, lymph node shape State, tumor size and histological grade；3) treatment record, including treatment type (operation, adjuvant hormonal therapy, adjuvant chemotherapy)； 4) prognosis records, the Time And Event including no far-end transfer survival rate (DMFS)；5) demography record (refers mainly to patient year Age).Based on the data set, genius morbi is developed and marker extracts prototype, and further improved personalized medicine Method, while the control group that TCGA data set will be used it as.

Disclosed breast cancer patients with brain shifts data set Salhia 2014, it includes the mRNA of 35 breast cancer patients with brain transfer cases Microarray (GEO:GSE5260), methylate (Figshare:862978) and copy number changes (Figshare:855629) Data.2010 data set of Silva, it includes mRNA microarray data (GEO:GSE14690), somatic mutation and 39 The Clinical and pathologic features of example primary breast cancer and matching brain metastes situation.Duchnowska 2015HER2+ data Collection, the data include 89 brain metastes tumours and 70 control groups.One, which shares 153 breast cancer patients with brain transfer cases, is included.

Two, data preprocessing module 2

It is that the three-dimensional based on patient, cell line and drug is different that this module, which establishes the interdependent net of relationship based on biometrical features, Composition.

Clinical large data sets show typical local dense and global sparse data mode, they cover different Data type.In order to illustrate this characteristic of data set, the present embodiment selects a small feature set (about 200 features) to make For Pre feasibility, it corresponds to 44 features.Having 8 in these features is molecular marked compound, remaining is demography or examines Disconnected feature (see the color-bar mark on right side).These features belong to four kinds of data types: numeric type, Ordinal, title type and Binary type.Therefore, which shows the isomerism of feature well.

The pairing of characteristic relation based on boot-strap and combination learning, the mechanism are used to solve different types of variables (numerical value Type, binary type, Ordinal and title type) between relevance and different sample size between related question.Two features Between the degree of association measured by a data rate, which refers to while the data that have value in two features account for the ratio of total data Example.Therefore, there is apparent variation to relevant data sampling size in conjunction with different characteristic.The present embodiment is used corresponding to specific The method of data type describes the whole strength of association between different concepts.Five kinds of different correlation measurements are used in the present embodiment In 10 groups of blended datas, every group all includes 4 kinds of data types.

Three, marker extraction module 3

Feature or marker (Signature) are exactly the spy in a characterizing gene or a signal path simply Levy gene set.The present invention proposes one signatome of building, it is a feature or marker collection, and this feature collection, which reflects, works as The preceding understanding to biosystem.Signatome is made of representative feature knowledge library, covers molecule, cell, cell It is interior, clinical and demography feature and event.Therefore, signatome provides unified " knowledge space ", as a kind of new Type measurement criterion, the criterion system and quantitatively describe the up-to-date knowledge in data sample.The feature that signatome is used From three databases: MSigDB characterization of molecules collection and DrugSig, the characteristic set database of pLINDAW.signatome Can highly extend: up-to-date knowledge can constantly be integrated into signatome.These features represent currently in molecular level On to the understanding of biomedical system.Genome therein will be used for group genetic enrichment for learning data analysis (GSEA) with determination The importance of character pair or marker in clinical samples.

MSigDB feature is that MIT-Harvard Broad Institute research institute maintains a large amount of clinical and scientific researches uses Biometrical features gene set and marker collection, they are known as characterization of molecules database (MSigDB3.0).In the database altogether There are 10295 features, they are used in early-stage study with genomic form.Including: 1) it is based on gene genetics position The gene set of relationship；2) leading type genome, for example disturbed from chemistry and heredity, typical molecular pathway, and by data Library BioCarta, KEGG genome and REACTOME are classified as main genome；3) target gene of microRNA and transcription because Son；4) genome obtained by calculation, including cancer neighbour gene and cancer module；5) and bioprocess, cell component and The relevant GO signal path database gene set of molecular function；6) oncogenic feature, it by NCBI GEO database micro- battle array Column data generates；7) immune characteristic of human immunity project alliance (HIPC) production.

The present embodiment tentatively establishes tag database: DrugSig and pLINDAW, they include breast cancer Drug marker, potential drug target, and from NIH LINCS project calculated various chemicals marker, and The molecular marker such as PAM50, Oncotype DXTM (21 for the inside that breast cancer metabolism marker and breast cancer share The marker of a gene),(marker of 70 genes) and Rotterdam Signature (76 The marker of gene), and the marker of the verified mistake of other in document.TCGA methylation marker methylation Signatures describe DNA methylation adjust gene function, Copy number variation marker, It is usually the very important mark of cancer with mutation marker.Marker is concentrated including molecule, cell, into the cell, clinical and people Mouth learns feature and event.

Four, Subtypes module 4

The purpose of patient's Subtypes is that patient is divided into different groups, then provides every group of patient towards patient Property medical services.Traditional clustering algorithm is usually a small amount of feature with patient to determine several hypotypes, and usually with minimum Overlapping between the hypotype of change is as objective function." coarse " hypotype does not have enough characteristics to distinguish the weight between patient in this way It distinguishes.The rapid advances of research and the clinical practice of personalised drug need finer richer hypotype, to optimize disease Rule treatment and monitoring by men.This clinical demand is answered, the present embodiment carries out Subtypes using H-cube algorithm, it can be fine Scale on gone systematically to identify the similar features that patient's subgroup is shared with many candidate markers of different nature.

H-cube algorithm need to be to solve: (1) mode of " patient-marker " is found, because some markers are What certain particular patients had, rather than all patients have, the bidirectional clustering of " marker -- patient " is developed in this requirement (bi-clustering) method；(2) by exploring huge feature space, multiple evidences are provided to a kind of mechanism: because of an Asia The potential pathogenesis of type may be related with multiple evidences such as genotype and phenotype expression in terms of different, such as DNA exception, table Genetic modification is seen, gene expression pattern is associated, signal path activity, receptor status, diagnostic function, character of living in groups, and treatment Reaction etc.；(3) similarity for the clinical subtype being overlapped between the feature and patient of main complexity is portrayed to bilinearization: because often Several hypotypes seen share the important feature in part and same patient may be associated with multiple hypotypes；(4) matching is different Clinical evidence: different hypotypes potentially contributes to different clinical practices, such as diagnose, risk assessment, the selection of drug, treatment And response prediction.Newfound hypotype is translated into useful knowledge (knowledge) to use clinically, this will be most important , because only that so just can determine that these hypotypes whether with clinical application have correlation and which knowledge be suitble to which send out Existing hypotype.

H-cube algorithm includes three steps: G-Score (richness of one marker of measurement in a gene set of marker Containing degree) calculate and general marker signatome generation, wherein signatome refers to the set from different markers, with And initial data is projected to the knowledge space of signatome；Reach identification by patient's subspace clustering to general marker Important patient's hypotype；And how to analyze similitude between these hypotypes.

H-cube algorithm carries out Subtypes as a result, specifically:

(3) Hasse tree graph is constructed based on Hashing mapping result；

Five, drug response prediction module 5

The drug response prediction model that the module is established is based on patient-cell strain-drug response three-dimensional dendrogram medicine Object response prediction model, specifically:

The present embodiment uses following measure: (1) for three with feature selecting of individuation drug response prediction research and development Isomery graph model；(2) model is verified with GEO data and clinical breast cancer biopsy sample to drug response and potential mechanism Predictive power.The success of this BDS4PM system will react for cancer drug provides a knowledge environment, and conversion current biological medicine is ground Study carefully the mode with clinical practice, and promotes biomedical big data to the conversion of individualized treatment.For biologically, pass through Signature can comprehensively describe the related mechanism of phenotype and different pharmaceutical reaction.Existed by these labels and drug response High correlation between patient and certain a kind of cell line can represent the cell line of patient this type.In skill For in art, with the arrival of big data era, the continuous product of patient and cell line these two types data and correlated characteristic label It is tired, it is supported so as to there is enough data, finds out the relevance of the drug response in patient and cell line, and explain it In include mechanism.The present embodiment is successfully confirmed by analyzing the similitude between breast cancer cell line and patient Above-mentioned basic principle.Then, the present invention develops a kind of three new step prediction models, which includes: 1) with signature The Algorithms of Non-Negative Matrix Factorization of guidance identifies the two-way modules of cell line and drug according to different drug responses；2) it is based on Each patient is mapped to most suitable cell line module up by cancer metastasis life span；3) exhaustion is searched in each module Rope support vector machines finds and selects respective signature.The invention proposes the random walks on isomery figure, before it is The extension of bidirectional clustering and feature selecting thought in phase work: by hereditary using parallel multi-Deme in feature space With the random walk on isomery figure in algorithm (PMDGA) and data entity space, patient-cell strain-drug response three is found Personalized treatment model is established to cluster.The purpose of the method proposed is to maximize the standard of the three-dimensional cluster of each identification Change the sum of purity, and complete in such a way that another kind updates, it may be assumed that feature selecting is carried out using PMDGA；Based on selected feature Update three-dimensional isomery figure；Isomery figure random walk towards three-dimensional cluster；Finally the superiority and inferiority of assessment three-dimensional cluster is to adjust feature Selection scheme.To sum up, drug response prediction model is based on the drug response of patient-cell strain-drug response three-dimensional dendrogram Prediction model specifically predicts process are as follows:

Claims

1. a kind of accurate intelligent diagnosis and treatment big data system, which is characterized in that the system includes:

Management module (1) in data set: data are learned with group to more medical institutions' clinic electronic health record data and are managed concentratedly；

Data preprocessing module (2): pre-processing the data of centralized management, establish the relationship based on biometrical features according to Deposit net；

Marker extraction module (3): being based on pretreated data, extracts patient characteristic gene and obtains marker collection；

Subtypes module (4): Subtypes are carried out to patient, determine group corresponding to patient；

Drug response prediction module (5): establishing drug response prediction model, predicts patient to not according to drug response prediction model With the reaction of drug.

2. a kind of accurate intelligent diagnosis and treatment big data system according to claim 1, which is characterized in that data manage mould concentratedly Block (1) is based on i2b2, SCILHS/SHRINE Data Share System and learns data to more medical institutions' clinic electronic health record data and group It carries out Dynamic Extraction, dynamic fusion and dynamic data set to generate, and then completes data centralized management.

3. a kind of accurate intelligent diagnosis and treatment big data system according to claim 1, which is characterized in that described based on biology The interdependent net of the relationship of medical features is the three-dimensional isomery figure based on patient, cell line and drug.

4. a kind of accurate intelligent diagnosis and treatment big data system according to claim 1, which is characterized in that the marker collection In include molecule, cell, into the cell, clinical and demography feature and event.

5. a kind of accurate intelligent diagnosis and treatment big data system according to claim 1, which is characterized in that Subtypes module (4) Subtypes are carried out by H-cube algorithm.

6. a kind of accurate intelligent diagnosis and treatment big data system according to claim 5, which is characterized in that H-cube algorithm carries out Subtypes specifically:

(3) Hasse tree graph is constructed based on Hashing mapping result；

7. a kind of accurate intelligent diagnosis and treatment big data system according to claim 1, which is characterized in that the drug response Prediction model is based on patient-cell strain-drug response three-dimensional dendrogram drug response prediction model.

8. a kind of accurate intelligent diagnosis and treatment big data system according to claim 7, which is characterized in that be based on patient-cell The drug response prediction model of strain-drug response three-dimensional dendrogram includes following prediction process:

(1) it uses and drug response analysis is carried out with the Algorithms of Non-Negative Matrix Factorization of signature guidance, it is anti-according to different drugs It should identify cell line and drug；

(3) the respective signature of patient is found and selected using exhaustive search support vector machines, determines that Patient drug reacts.