CN110335679B - Survival prediction method and system based on multi-granularity graph mode mining - Google Patents

Survival prediction method and system based on multi-granularity graph mode mining Download PDF

Info

Publication number
CN110335679B
CN110335679B CN201910543992.5A CN201910543992A CN110335679B CN 110335679 B CN110335679 B CN 110335679B CN 201910543992 A CN201910543992 A CN 201910543992A CN 110335679 B CN110335679 B CN 110335679B
Authority
CN
China
Prior art keywords
mining
mode
graph
granularity
survival prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910543992.5A
Other languages
Chinese (zh)
Other versions
CN110335679A (en
Inventor
史玉良
任永健
郑永清
张坤
陈志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201910543992.5A priority Critical patent/CN110335679B/en
Publication of CN110335679A publication Critical patent/CN110335679A/en
Application granted granted Critical
Publication of CN110335679B publication Critical patent/CN110335679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The utility model discloses a survival prediction method and a system based on multi-granularity graph mode mining, which comprises the following steps: receiving medical history data of the ginseng protector, and statistically constructing statistical characteristics of the ginseng protector; for each paramedic, constructing a heterogeneous information network according to the hospitalizing historical data of the paramedic; mining medical modes from a heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics; respectively evaluating partial regression coefficients of the statistical characteristics and the mode characteristics by adopting a Cox survival prediction model, and calculating the risk of each characteristic based on a proportional risk hypothesis to serve as an enhanced characteristic; and merging the statistical features, the mode features and the enhanced features into a complete feature set, and performing survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data.

Description

Survival prediction method and system based on multi-granularity graph mode mining
Technical Field
The disclosure belongs to the technical field of electronic health data processing, and relates to a survival prediction method and a survival prediction system based on multi-granularity graph mode mining.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Survival prediction studies predict how long after an event of interest (also referred to as a "failure event") will occur from a certain point in time based on follow-up information. Generally, the main objectives of survival prediction include exploring the role of prognostic factors, predicting how new patients will behave. Thus, it will help physicians answer realistic questions such as how much benefit a certain medical treatment will be to the patient, the estimated life of the patient after treatment, and so forth. The predicted event of interest for survival is not limited to "death," and can be an event of disease occurrence, readmission, discharge, site visit, and the like.
Today, the ever-increasing electronic health data makes survival prediction more challenging.
Typically, the variables used for prediction are statistically derived over the follow-up period, such as age, gender, blood test results, and the like. The search for regression coefficients for these factors is an important aspect of survival prediction. From a parametric perspective, these models include parametric, non-parametric, and semi-parametric models. From the viewpoint of whether or not the parameter changes with time, the parameter is divided into a parameter that does not change with time and a parameter that changes with time (referred to as a time-varying coefficient).
However, the inventors have discovered during their research that electronic health data enables us to obtain a substantially arbitrary record of the time of occurrence, unlike the conventional relatively fixed-time information acquisition modalities. These time series data constitute a behavioral record of the patient (including what disease was experienced, what treatment was received). The traditional characteristic construction method has difficulty in defining characteristics describing the evolution process of the health condition of a patient.
Deep learning models (in particular recurrent neural networks such as LSTM, GRU, etc.) are an important direction to solve the above problem. However, at this stage, people still have difficulty in explaining the learned features and time sequence patterns of the recurrent neural network. However, it is important to explain the prediction in the medical field and the like.
Continuous-time bayesian networks (CTBN) are another important direction towards such problems. The CTBN models the survival probability as an exponential function of time, describing the impact of different events on the survival time by conditional probability. The model is to some extent interpretable. However, the markov assumption limits the expressive power of CTBN to modeling as an exponential distribution over time. To allow a more expressive distribution, hidden states must be introduced explicitly (by hiding variables) or implicitly (by using a delay index or a mix of indices). However, these extensions are not simple.
Disclosure of Invention
Aiming at the defects in the prior art, one or more embodiments of the disclosure provide a survival prediction method and a survival prediction system based on multi-granularity graph mode mining, wherein the survival prediction method and the survival prediction system model medical data of security personnel based on the multi-granularity graph mode mining and further predict the survival.
According to an aspect of one or more embodiments of the present disclosure, there is provided a survival prediction method based on multi-granular graph pattern mining.
A survival prediction method based on multi-granularity graph mode mining comprises the following steps:
receiving medical history data of the ginseng protector, and statistically constructing statistical characteristics of the ginseng protector;
for each paramedic, constructing a heterogeneous information network according to the hospitalizing historical data of the paramedic;
mining medical modes from a heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics;
respectively evaluating partial regression coefficients of the statistical characteristics and the mode characteristics by adopting a Cox survival prediction model, and calculating the risk of each characteristic based on a proportional risk hypothesis to serve as an enhanced characteristic;
and merging the statistical features, the mode features and the enhanced features into a complete feature set, and performing survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data.
Further, in the method, the medical history data of each ginseng and insurance person is medical records of the same ginseng and insurance person in different hospitals, and the statistical characteristics of the ginseng and insurance person comprise the disease name and the total number of samples corresponding to the disease name, the total number of medical event records and the medical item category of all the ginseng and insurance persons suffering from the disease, the total number of deleted samples and the proportion of deleted samples.
Further, in the method, the heterogeneous information network is used for describing the hospitalizing behavior record of the insured person, and comprises the following steps: the disease event, the medical item, the sequence relation between two adjacent disease events and the use relation between the disease event and the medical item;
the disease event comprises the disease time and the disease;
the medical items include drugs, examination items, and specific treatment items.
Further, in the method, the mining of the hospitalizing mode from the heterogeneous information network by using the multi-granularity graph mode mining algorithm includes the specific steps as mode features:
executing a graph mode mining algorithm under a plurality of time granularities, mining a medical mode under a specific time granularity, and obtaining medical modes under different time granularities;
adopting a greedy strategy cluster searching algorithm to select top-k based on support;
and carrying out isomorphic testing on different graph objects by utilizing a minimum DFS coding technology, and screening top-k hospitalizing modes as mode characteristics.
Further, the method further comprises, when executing the graph pattern mining algorithm at a plurality of time granularities, performing graph pattern mining by layers; and setting a cluster width threshold value, and pruning the graph mode mining process by utilizing cluster searching.
Further, in the method, when the survival prediction is carried out by using the random forest, a scoring function of the improved random forest is adopted to be suitable for the survival prediction problem, and violation rate scoring items are increased; and the violation rate evaluation item is used for evaluating the severity of the conflict between the label of the deleted data and the model predicted value.
Further, the method further comprises: combining the statistical features, the mode features and the enhanced features into a complete feature set, and training the feature set by adopting a random forest scoring function for increasing the estimation deletion data prediction condition to obtain a survival prediction model; and receiving data of the reference and insurance people to be predicted, and predicting the survival according to the survival prediction model.
According to an aspect of one or more embodiments of the present disclosure, there is provided a computer-readable storage medium.
A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute a multi-granular graph pattern mining based survival prediction method.
According to an aspect of one or more embodiments of the present disclosure, there is provided a terminal device.
A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the survival prediction method based on multi-granularity graph mode mining.
According to an aspect of one or more embodiments of the present disclosure, there is provided a survival prediction apparatus based on multi-granular graph pattern mining.
A survival prediction device based on multi-granularity graph mode mining is based on the survival prediction method based on multi-granularity graph mode mining, and comprises the following steps:
the statistical characteristic extraction module is configured to receive the hospitalizing history data of the insured person and statistically construct the statistical characteristics of the insured person;
the heterogeneous information network construction module is configured to construct a heterogeneous information network according to the hospitalizing historical data of each insured person;
the multi-granularity medical mode mining module is configured to mine medical modes from the heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics;
an enhanced feature generation module configured to evaluate partial regression coefficients of the statistical features and the pattern features, respectively, using a Cox survival prediction model, and calculate a risk of each feature as an enhanced feature based on a proportional risk assumption;
and the random forest prediction module is configured to combine the statistical features, the mode features and the enhanced features into a complete feature set and perform survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data.
The beneficial effect of this disclosure:
(1) according to the survival prediction method and the survival prediction system based on multi-granularity graph mode mining, in order to fully utilize the historical hospitalizing records of the patient and the information of medical items used by the hospitalizing records each time, the historical hospitalizing records of the patient are modeled into a heterogeneous information network. Frequent patterns are found by mining medical records of patients, the frequent patterns are regarded as features, and random forests are used for prediction. The advantage of using frequent patterns and random forests is that the prediction results are easier to interpret.
(2) According to the survival prediction method and the survival prediction system based on multi-granularity graph mode mining, when a frequent mode is mined, not only are the types of diseases and used medical items taken into consideration, but also the time intervals among medical events are taken into consideration. Specifically, (a) proposes a "multi-granularity pattern mining" process, enabling the proposed model to find useful hospitalization patterns at a more appropriate time granularity. (b) The scoring function in the pattern mining process is adjusted, so that the mining process not only considers the 'support degree' of the pattern, but also considers the prediction capability of the pattern. (c) The depth-first gSpan algorithm is changed into a mining algorithm of searching according to layers, so that greedy strategy cluster searching is favorably adopted to reduce the time complexity of mode mining.
(3) According to the survival prediction method and the survival prediction system based on multi-granularity graph mode mining, random forests are adopted for regression prediction, so that any assumption does not need to be made on the mapping relation between the features and the survival time. For the survival prediction task, we improve the penalty function of the classification regression tree (CART) in random forests, enabling it to exploit the censored data. In addition, the deep learning model is inspired by the fact that the Cox model is used for generating enhanced features (namely, risks of each feature) on the basis of original features, and the enhanced features and the risks are connected with the original features in series to form a complete feature set for prediction of a random forest.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow diagram of a multi-granular graph pattern mining based survival prediction method in accordance with one or more embodiments;
FIG. 2 is an overall flow diagram of a particular multi-granular graph pattern mining-based survival prediction method in accordance with one or more embodiments;
FIG. 3 is an illustration of a HIN structure pattern in accordance with one or more embodiments;
FIG. 4 is a sample diagram of an implementation of graph mode mining for a per-layer search in accordance with one or more embodiments;
FIG. 5 is an overall flow diagram of a graph pattern mining based survival prediction apparatus in accordance with one or more embodiments.
The specific implementation mode is as follows:
technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present disclosure, and it is to be understood that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from one or more embodiments of the disclosure without making any creative effort, shall fall within the scope of protection of the disclosure.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
It is noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Without conflict, the embodiments and features of the embodiments in the present disclosure may be combined with each other, and the present disclosure will be further described with reference to the drawings and the embodiments.
Interpretation of terms:
minimum DFS coding. The DFS coding is called Depth-First Search code in english, that is, Depth-First Search coding, and is a coding technique for describing a graph structure in the field of graph mode mining. A graph structure may have multiple DFS codes that may be compared in size in a lexicographic order. Thus, a graph structure may have multiple DFS codes, but its minimum DFS code exists and is unique. Therefore, the method comprises the following steps: a minimum DFS code may represent a graph structure; isomorphic testing of different graph objects is equivalent to comparing the minimum DFS encodings of these graph objects.
Example one
According to an aspect of one or more embodiments of the present disclosure, there is provided a survival prediction method based on multi-granular graph pattern mining.
Step S1: receiving medical history data of the ginseng protector, and statistically constructing statistical characteristics of the ginseng protector;
step S2: for each paramedic, constructing a heterogeneous information network according to the hospitalizing historical data of the paramedic;
step S3: mining medical modes from a heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics;
step S4: respectively evaluating partial regression coefficients of the statistical characteristics and the mode characteristics by adopting a Cox survival prediction model, and calculating the risk of each characteristic based on a proportional risk hypothesis to serve as an enhanced characteristic;
step S5: and merging the statistical features, the mode features and the enhanced features into a complete feature set, and performing survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data.
In step S1 of one or more embodiments of the present disclosure, the "statistical features" of the insurer are constructed using statistical methods.
Specifically, the social security system based on a certain grade city acquires medical history data as an application example, and the data of the social security system comprises medical records of the same patient in different hospitals, so that the patient can obtain more complete historical medical records. 6 diseases with a large number of patients were selected as "basic diseases" to generate data sets for 6 experiments. The selected diseases were Dynamic Undefined Tumor (DUT), lung cancer (LUN), Pneumonia (PNE), coronary heart disease (COR), cerebral infarction (CER) and Diabetes (DIA), respectively. Taking a dynamic undetermined tumor as an example, the data set includes patients all suffering from the disease, dynamic undetermined tumor, and for the patients belonging to the data set, all the history records of the disease are included in the data set. Table 1 shows the statistics of the data sets corresponding to 6 diseases. The total number of medical records and the types of medical items include not only the medical records and medical items of the basic diseases, but also the medical records and medical items of other diseases suffered by the patients of the data set.
TABLE 1 data set statistics
Figure GDA0002417558340000091
In step S2 of one or more embodiments of the present disclosure, for each insured person, a Heterogeneous Information Network (HIN) is constructed from their medical records, the HIN describing the medical records of the insured person.
Specifically, the present embodiment creates a separate HIN for each patient, as shown in fig. 2. Fig. 3 shows the HIN of a patient. Where the rectangle represents a diseased event, uniquely identified by event time and disease name. The circle represents a medical item, uniquely identified by a medical item number. The medical items include drugs, examinations, and specific treatment means (e.g., intravenous injection, continuous low flow aspiration, etc.). The relationship between 2 adjacent events is labeled with the time interval of two events. As can be seen in fig. 3, the patient had 2 diseases in succession. There are 2 medical items commonly used for the 2 nd and 3 rd patients.
In step S3 of one or more embodiments of the present disclosure, the hospitalization patterns are mined from the HIN using multi-granularity graph pattern mining.
The meaning of significant symbols appearing in the multi-granular graph pattern mining process is first described in table 2.
TABLE 2 meanings of significant symbols appearing in the Multi-granular Pattern mining Algorithm
Figure GDA0002417558340000092
Figure GDA0002417558340000101
In the present disclosure, a "minimum DFS code" represents a graph structure that is regarded as a medical mode when the graph structure satisfies the minimum support level set in the present embodiment. Thus, in this embodiment, the pattern set P, list LcAnd LnThe elements in (1) are all "minimum DFS codes", i.e., a string of character sequences representing a graph structure generated according to the DFS coding technique. In the presentation, call list LcAnd LnThe element in (1) is the "minimum DFS coding", and the element in the pattern set P is called the "hospitalization pattern".
The concrete steps of mining the hospitalizing mode from the HIN comprise:
step S31: and (3) multi-granularity graph pattern mining, namely performing multiple times of graph pattern mining under multiple time granularities, as shown in an algorithm 1. The FPWG (P, gr) is an algorithm, which will be described in detail in step S32, and is used to supplement or update the existing medical mode set P at a specific time granularity gr. Finally outputting a hospitalization pattern set P, wherein each element in the P is regarded as a pattern feature.
Figure GDA0002417558340000102
Figure GDA0002417558340000111
Step S32: mining the hospitalization pattern (FPWG) at a specific time granularity, as shown in algorithm 2, includes:
step S321: adding the minimum DFS codes corresponding to all sub-graphs with side length of 1 to LcIn (1).
Step S322: for list LcInvestigation of LcWhether each minimum DFS code in (b) satisfies the minimum support degree (min _ sup) and culling the minimum DFS codes that do not satisfy the condition.
Step S323: if L iscIf it is empty, indicating that a new hospitalization mode has not been mined from the atlas G, then the mining process is terminated. Otherwise, go to step S324.
Step S324: mixing L withcThe minimum DFS code of (1) will be added to P.
Step S325: if the number of the hospitalizing modes in P is larger than a set threshold B, then:
(1) only B modes with larger support degree are reserved in P, and the excessive modes are removed from P;
(2) updating Lc,LcFor generating L, only the mode present in P remainsn. The step is pruning the mode excavation based on the bundling width threshold value.
Step S326: according to LcFrom the algorithm GNLD (L)c) Obtaining Ln. Wherein GNLD (L)c) The role of (a) is to generate the minimum DFS code of the next layer, and the specific steps will be shown in step S33.
Step S327: by LnUpdating LcAnd proceeds to step S323.
Figure GDA0002417558340000112
Figure GDA0002417558340000121
Step S33: generating the minimum DFS encoding (GNLD) for the next layer, as shown in algorithm 3, includes:
step S331: initialization LnIs an empty list.
Step S332: as shown in fig. 4 for LnEach graph object s, s pointed to by each minimum DFS code in (a) forms a new graph object s' by adding an edge. s different graph objects s' can be formed by adding different edges. For each newly generated graph object S', step S333 and step S334 are executed.
Step S333: for each newly generated graph object s ', a minimum DFS coding, denoted as D, is generated for s' using a minimum DFS coding techniques′
Step S334: if L isnDoes not contain Ds′Then, D iss′Is added to LnAnd adding Ds′A pointer to s'. If L isnAlready contains Ds′Then, it is determined whether s' already exists in Ds′If not, add D to pointed objects′A pointer to s'.
Step S335: l isnAnd D not meeting min _ sup is removed.
Figure GDA0002417558340000131
Figure GDA0002417558340000141
In step S4 of one or more embodiments of the present disclosure, the "statistical features" and the "pattern features" are mapped to "enhanced features" using the Cox survival prediction model. Table 3 shows the meaning of the significant symbols appearing in the generate enhanced features module, where u represents the ginseng saver.
TABLE 3 Generation of meanings of significant symbols appearing in enhanced feature modules
Figure GDA0002417558340000142
The method comprises the following specific steps:
step S41: for the feature F ∈ Fp∪FsCalculating partial regression coefficient β of f by using Cox survival prediction modelfFor example, the present embodiment utilizes the Cox regression tool in IBM SPSS Statistics to count the partial regression coefficients for each feature.
Step S42: the "risk" of each feature is calculated. For feature f, let feAn enhanced feature representing the feature. If the value of the sample a on the feature f is x ═ f (a), the embodiment converts the value of the sample a on f into the risk brought by the feature f to the sample a through the exponential function:
fe(A)=exp(βf*x)
thus, the original features will show a different distribution than the enhanced features (risks).
In step S5 of one or more embodiments of the present disclosure, a survival prediction is made based on the improved random forest.
The classical regression tree selects the features and feature values that divide the data set according to the mean square error gain σ. Is provided with
Figure GDA0002417558340000151
Representing the complete sample set, vjIs a characteristic fiA value of, a sample set
Figure GDA0002417558340000152
By a characteristic fiValue v ofjCollecting the samples
Figure GDA0002417558340000153
Is divided into S1And S2. CART uses gain _ σ (S | f)i,vj)=σ(S1)+σ(S2) To measure the dividing effect. Wherein, σ (S)1) Is S1The variance of the middle element. The smaller the variance, the better the partitioning effect. Each strokeTime-sharing, CART selects the one that produces the smallest gain _ sigma<fi,vj>The division is performed.
However, for the survival prediction task, if the missing data is not considered, it is possible to bias the model training and outcome prediction. In the examples, we denote the samples by x, and use x*Indicating a missing sample. To take full advantage of the erasure data, inspired by CI, we introduce a violation rate gain _ vr (S | f)i,vj)
gain_vr(S|fi,vj)=vr(S1)+vr(S2)
Wherein the content of the first and second substances,
Figure GDA0002417558340000154
where card (A) indicates the number of elements in set A.
vr (S) describes the tag and prediction value of the deleted data in set S
Figure GDA0002417558340000155
Whether or not there is a conflict.
Note that gain _ vr (S | f)i,vj) Has a value range of [0,1 ]]. Consider gain _ sigma (S | f)i,vj) And gain _ vr (S | f)i,vj) Before, we should deal with gain _ σ (S | f)i,vj) And (6) carrying out normalization. To this end, we have a decision tree based on<fi,vj>Cost function after partitioning
Figure GDA0002417558340000163
α, the weights of the two terms are adjusted in this embodiment, α is 0.5.
Obviously, the smaller the cost, the better.
One or more embodiments of the present disclosure further include step 6 of performing survival prediction on the data set of the above 6 disease constituents. And the predicted effect of the model was evaluated as Mean Absolute Error (MAE) and consistency index (ConcordanceIndex, CI).
TABLE 4 Mean Absolute Error (MAE)
Figure GDA0002417558340000161
TABLE 5 Consistency Index (CI)
Figure GDA0002417558340000162
Figure GDA0002417558340000171
As can be seen from tables 4 and 5, the present embodiment, namely the survival prediction model (in the tables, denoted by M) based on multi-granular pattern mining, has better prediction capability than the existing model in most cases.
Example two
According to an aspect of one or more embodiments of the present disclosure, there is provided a computer-readable storage medium.
A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute a multi-granular graph pattern mining based survival prediction method.
EXAMPLE III
According to an aspect of one or more embodiments of the present disclosure, there is provided a terminal device.
A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the survival prediction method based on multi-granularity graph mode mining.
These computer-executable instructions, when executed in a device, cause the device to perform methods or processes described in accordance with various embodiments of the present disclosure.
Among other things, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present disclosure by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Example four
According to an aspect of one or more embodiments of the present disclosure, there is provided a survival prediction apparatus based on multi-granular graph pattern mining.
As shown in fig. 5, a survival prediction apparatus based on multi-granularity graph mode mining, based on the survival prediction method based on multi-granularity graph mode mining, includes:
the statistical characteristic extraction module is configured to receive the hospitalizing history data of the insured person and statistically construct the statistical characteristics of the insured person;
the heterogeneous information network construction module is configured to construct a heterogeneous information network according to the hospitalizing historical data of each insured person;
the multi-granularity medical mode mining module is configured to mine medical modes from the heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics;
an enhanced feature generation module configured to evaluate partial regression coefficients of the statistical features and the pattern features, respectively, using a Cox survival prediction model, and calculate a risk of each feature as an enhanced feature based on a proportional risk assumption;
and the random forest prediction module is configured to combine the statistical features, the mode features and the enhanced features into a complete feature set and perform survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data.
It should be noted that although several modules or sub-modules of the device are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A survival prediction method based on multi-granularity graph mode mining is characterized by comprising the following steps:
receiving medical history data of the ginseng protector, and statistically constructing statistical characteristics of the ginseng protector;
for each paramedic, constructing a heterogeneous information network according to the hospitalizing historical data of the paramedic;
mining medical modes from a heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics;
respectively evaluating partial regression coefficients of the statistical characteristics and the mode characteristics by adopting a Cox survival prediction model, and calculating the risk of each characteristic based on a proportional risk hypothesis to serve as an enhanced characteristic;
combining the statistical characteristics, the mode characteristics and the enhanced characteristics into a complete characteristic set, and performing survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data;
mining a hospitalizing mode from the HIN by utilizing a multi-granularity graph mode, and specifically comprising the following steps:
step S31: multi-granularity graph mode mining, namely performing multiple graph mode mining under multiple time granularities;
step S32: mining the hospitalization patterns at a particular time granularity, comprising:
step S321: adding the minimum DFS codes corresponding to all sub-graphs with side length of 1 to LcPerforming the following steps;
step S322: for list LcInvestigation of LcWhether each minimum DFS code in the list meets the minimum support degree or not is judged, and the minimum DFS codes which do not meet the conditions are eliminated;
step S323: if L iscIf the result is null, it indicates that a new hospitalization mode cannot be mined from the atlas G, the mining process is terminated, otherwise, the process goes to step S324;
step S324: mixing L withcThe minimum DFS code of (1) will be added to P;
step S325: if the number of the hospitalizing modes in P is larger than a set threshold B, then:
only B modes with larger support degree are reserved in P, and the excessive modes are removed from P;
updating Lc,LcFor generating L, only the mode present in P remainsnPruning the mode excavation based on the bundling width threshold value;
step S326: according to LcL is obtained from the algorithm GNLDnWherein, the GNLD is used to generate the minimum DFS code of the next layer, and the specific steps will be shown in step S33;
step S327: by LnUpdating LcAnd go to step S323;
step S33: generating a minimum DFS code of a next layer, comprising:
step S331: initialization LnA null list;
step S332: for LnForming a new graph object S ' by adding an edge to each graph object S, S pointed to by each minimum DFS code in S, and forming a different graph object S ' by adding a different edge, for each newly generated graph object S ', performing step S333 and step S334;
step S333: for each newly generated graph object s ', a minimum DFS coding, denoted as D, is generated for s' using a minimum DFS coding techniques′
Step S334: if L isnDoes not contain Ds′Then, D iss′Is added to LnAnd adding Ds′A pointer to s'; if L isnAlready contains Ds′Then, it is determined whether s' already exists in Ds′If not, add D to pointed objectsA pointer to s';
step S335: l isnAnd D not meeting min _ sup is removed.
2. The survival prediction method based on multi-granularity graph mode mining as claimed in claim 1, wherein in the method, the medical history data of each participant is medical records of the same participant in different hospitals, and the statistical characteristics of the participant comprise disease names and sample counts corresponding to the disease names, total medical event record counts and medical item categories of all the sick participants, total deleted samples and deleted sample ratios.
3. The survival prediction method based on multi-granularity graph mode mining as claimed in claim 1, wherein in the method, the heterogeneous information network is used for describing the hospitalization behavior records of the insured person, and comprises: the disease event, the medical item, the sequence relation between two adjacent disease events and the use relation between the disease event and the medical item;
the disease event comprises the disease time and the disease;
the medical items include drugs, examination items, and specific treatment items.
4. The survival prediction method based on multi-granularity graph mode mining as claimed in claim 1, wherein in the method, the mining of the hospitalization mode from the heterogeneous information network by using a multi-granularity graph mode mining algorithm comprises the following specific steps as mode features:
executing a graph mode mining algorithm under a plurality of time granularities, mining a medical mode under a specific time granularity, and obtaining medical modes under different time granularities;
adopting a greedy strategy cluster searching algorithm to select top-k based on support;
and carrying out isomorphic testing on different graph objects by utilizing a minimum DFS coding technology, and screening top-k hospitalizing modes as mode characteristics.
5. The multi-granularity graph pattern mining based survival prediction method as claimed in claim 4, further comprising performing graph pattern mining layer by layer when executing a graph pattern mining algorithm at a plurality of time granularities; and setting a cluster width threshold value, and pruning the graph mode mining process by utilizing cluster searching.
6. The survival prediction method based on multi-granularity graph mode mining as claimed in claim 1, characterized in that in the method, when using random forest for survival prediction, the score function of the improved random forest is adopted to make it suitable for the survival prediction problem, and the violation rate score is increased; and the violation rate evaluation item is used for evaluating the severity of the conflict between the label of the deleted data and the model predicted value.
7. The survival prediction method based on multi-granularity graph mode mining as claimed in claim 1, wherein the method further comprises: combining the statistical features, the mode features and the enhanced features into a complete feature set, and training the feature set by adopting a random forest scoring function for increasing the estimation deletion data prediction condition to obtain a survival prediction model; and receiving data of the reference and insurance people to be predicted, and predicting the survival according to the survival prediction model.
8. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute a multi-granular graph pattern mining based survival prediction method according to any one of claims 1 to 7.
9. A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; a computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform a multi-granular graph pattern mining based survival prediction method according to any one of claims 1 to 7.
10. A survival prediction device based on multi-granularity graph mode mining, which is based on the survival prediction method based on multi-granularity graph mode mining as claimed in any one of claims 1 to 7, and comprises:
the statistical characteristic extraction module is configured to receive the hospitalizing history data of the insured person and statistically construct the statistical characteristics of the insured person;
the heterogeneous information network construction module is configured to construct a heterogeneous information network according to the hospitalizing historical data of each insured person;
the multi-granularity medical mode mining module is configured to mine medical modes from the heterogeneous information network by adopting a multi-granularity graph mode mining algorithm as mode characteristics;
an enhanced feature generation module configured to evaluate partial regression coefficients of the statistical features and the pattern features, respectively, using a Cox survival prediction model, and calculate a risk of each feature as an enhanced feature based on a proportional risk assumption;
the random forest prediction module is configured to combine the statistical features, the mode features and the enhanced features into a complete feature set and perform survival prediction by adopting a random forest scoring function for increasing the prediction condition of the evaluation deletion data;
mining a hospitalizing mode from the HIN by utilizing a multi-granularity graph mode, and specifically comprising the following steps:
step S31: multi-granularity graph mode mining, namely performing multiple graph mode mining under multiple time granularities;
step S32: mining the hospitalization patterns at a particular time granularity, comprising:
step S321: adding the minimum DFS codes corresponding to all sub-graphs with side length of 1 to LcPerforming the following steps;
step S322: for list LcInvestigation of LcWhether each minimum DFS code in the list meets the minimum support degree or not is judged, and the minimum DFS codes which do not meet the conditions are eliminated;
step S323: if L iscIf the result is null, it indicates that a new hospitalization mode cannot be mined from the atlas G, the mining process is terminated, otherwise, the process goes to step S324;
step S324: mixing L withcThe minimum DFS code of (1) will be added to P;
step S325: if the number of the hospitalizing modes in P is larger than a set threshold B, then:
only B modes with larger support degree are reserved in P, and the excessive modes are removed from P;
updating Lc,LcFor generating L, only the mode present in P remainsnPruning the mode excavation based on the bundling width threshold value;
step S326: according to LcL is obtained from the algorithm GNLDnWherein, the GNLD is used to generate the minimum DFS code of the next layer, and the specific steps will be shown in step S33;
step S327: by LnUpdating LcAnd go to step S323;
step S33: generating a minimum DFS code of a next layer, comprising:
step S331: initialization LnA null list;
step S332: for LnEach of (1) toEach graph object S, S pointed to by the minimum DFS codes forms a new graph object S ' by adding an edge, S can form a different graph object S ' by adding different edges, and for each newly generated graph object S ', steps S333 and S334 are performed;
step S333: for each newly generated graph object s ', a minimum DFS coding, denoted as D, is generated for s' using a minimum DFS coding techniques′
Step S334: if L isnDoes not contain Ds′Then, D iss′Is added to LnAnd adding Ds′A pointer to s'; if L isnAlready contains Ds′Then, it is determined whether s' already exists in Ds′If not, add D to pointed objects′A pointer to s';
step S335: l isnAnd D not meeting min _ sup is removed.
CN201910543992.5A 2019-06-21 2019-06-21 Survival prediction method and system based on multi-granularity graph mode mining Active CN110335679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910543992.5A CN110335679B (en) 2019-06-21 2019-06-21 Survival prediction method and system based on multi-granularity graph mode mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910543992.5A CN110335679B (en) 2019-06-21 2019-06-21 Survival prediction method and system based on multi-granularity graph mode mining

Publications (2)

Publication Number Publication Date
CN110335679A CN110335679A (en) 2019-10-15
CN110335679B true CN110335679B (en) 2020-05-05

Family

ID=68140361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910543992.5A Active CN110335679B (en) 2019-06-21 2019-06-21 Survival prediction method and system based on multi-granularity graph mode mining

Country Status (1)

Country Link
CN (1) CN110335679B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537540B (en) * 2020-04-14 2022-09-30 天津科技大学 Heat supply gas consumption prediction model based on automatic characteristic engineering

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8594410B2 (en) * 2006-08-28 2013-11-26 Definiens Ag Context driven image mining to generate image-based biomarkers
CN101421736A (en) * 2006-04-17 2009-04-29 美国西门子医疗解决公司 Personalized prognosis modeling in medical treatment planning
CN106980757A (en) * 2017-03-15 2017-07-25 重庆医科大学 The concurrent coronary artery pathological changes hazards management system of Kawasaki disease and method for digging
CN108877939A (en) * 2018-05-10 2018-11-23 重庆大学 It is a kind of with the health management system arranged of intelligent characteristic abstraction function
CN109599181B (en) * 2019-01-09 2021-04-16 中国医学科学院肿瘤医院 Survival prediction system and prediction method for T3-LARC patient before treatment

Also Published As

Publication number Publication date
CN110335679A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
Churpek et al. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards
Jalal et al. An overview of R in health decision sciences
CN111681726B (en) Processing method, device, equipment and medium of electronic medical record data
Johnson et al. A new severity of illness scale using a subset of acute physiology and chronic health evaluation data elements shows comparable predictive accuracy
CN104572583B (en) Method and system for data densification
CN109378065A (en) Medical data processing method and processing device, storage medium, electronic equipment
JP2021525411A (en) Personalized device recommendations for proactive health monitoring and management
Rondeau et al. Cure frailty models for survival data: application to recurrences for breast cancer and to hospital readmissions for colorectal cancer
WO2020102220A1 (en) Adherence monitoring through machine learning and computing model application
CN109448858A (en) Colorectal cancer medical data processing method and processing device, storage medium, electronic equipment
CN111145905A (en) Target decision model construction method and device, electronic equipment and storage medium
CN114078597A (en) Decision trees with support from text for healthcare applications
Lyaqini et al. Supervised learning as an inverse problem based on non-smooth loss function
CN111581969B (en) Medical term vector representation method, device, storage medium and electronic equipment
JP2023041620A (en) Construction method of drug synergetic effect prediction model, prediction method, and corresponding device
Diez et al. Markov influence diagrams: A graphical tool for cost-effectiveness analysis
Tavazzi et al. A process mining approach to statistical analysis: Application to a real-world advanced melanoma dataset
Chakraborty et al. From machine learning to deep learning: An advances of the recent data-driven paradigm shift in medicine and healthcare
Shams et al. REM: an integrative rule extraction methodology for explainable data analysis in healthcare
Parimanam et al. Hybrid optimization based learning technique for multi‐disease analytics from healthcare big data using optimal pre‐processing, clustering and classifier
CN110335679B (en) Survival prediction method and system based on multi-granularity graph mode mining
Nayebi et al. WindowSHAP: An efficient framework for explaining time-series classifiers based on Shapley values
Sirbu et al. Early outcome detection for COVID-19 patients
Nematzadeh et al. Ensemble-based genetic algorithm explainer with automized image segmentation: A case study on melanoma detection dataset
CN111640517A (en) Medical record encoding method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant