CN114121296B - Data-driven clinical information rule extraction method, storage medium and equipment - Google Patents
Data-driven clinical information rule extraction method, storage medium and equipment Download PDFInfo
- Publication number
- CN114121296B CN114121296B CN202111500068.2A CN202111500068A CN114121296B CN 114121296 B CN114121296 B CN 114121296B CN 202111500068 A CN202111500068 A CN 202111500068A CN 114121296 B CN114121296 B CN 114121296B
- Authority
- CN
- China
- Prior art keywords
- rule
- data
- rule set
- optimal
- clinical information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 45
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 239000013610 patient sample Substances 0.000 claims abstract description 34
- 238000012216 screening Methods 0.000 claims abstract description 12
- 239000002245 particle Substances 0.000 claims description 24
- 238000000034 method Methods 0.000 claims description 22
- 238000005457 optimization Methods 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 201000010099 disease Diseases 0.000 claims description 10
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000003745 diagnosis Methods 0.000 claims description 7
- 238000007619 statistical method Methods 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 7
- 238000007781 pre-processing Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 239000000523 sample Substances 0.000 description 4
- 208000010378 Pulmonary Embolism Diseases 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000003759 clinical diagnosis Methods 0.000 description 3
- 210000003141 lower extremity Anatomy 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 206010014513 Embolism arterial Diseases 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000002685 pulmonary effect Effects 0.000 description 2
- 108010005094 Advanced Glycation End Products Proteins 0.000 description 1
- 241000272778 Cygnus atratus Species 0.000 description 1
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 208000001647 Renal Insufficiency Diseases 0.000 description 1
- 206010046996 Varicose vein Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000007795 chemical reaction product Substances 0.000 description 1
- 208000020832 chronic kidney disease Diseases 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 210000003414 extremity Anatomy 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000024924 glomerular filtration Effects 0.000 description 1
- 230000013595 glycosylation Effects 0.000 description 1
- 238000006206 glycosylation reaction Methods 0.000 description 1
- 201000006370 kidney failure Diseases 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008929 regeneration Effects 0.000 description 1
- 238000011069 regeneration method Methods 0.000 description 1
- 230000003716 rejuvenation Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 208000004043 venous thromboembolism Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Abstract
The invention provides a data-driven clinical information rule extraction method, a storage medium and equipment, wherein the data-driven clinical information rule extraction method comprises the following steps: obtaining patient sample data, the patient sample data including individual clinical features of a patient; generating an initial rule set from the patient sample data; screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain a universality rule set; and determining an optimal rule set through the accuracy and the interpretability of each rule in the universality rule set. According to the invention, a series of rules with high confidence and accuracy can be mined from clinical information on the premise of ensuring accuracy, so that a clear conclusion path can be effectively obtained, and a doctor can be assisted to make a decision to a certain extent.
Description
Technical Field
The invention belongs to the technical field of data mining, relates to a rule extraction method, and in particular relates to a data-driven clinical information rule extraction method, a storage medium and equipment.
Background
At present, with the development of intelligent medical technology, medical rules play an important role in the processes of risk prediction, clinical diagnosis and the like of diseases, wherein mining rules with high confidence in data such as clinical diagnosis information, demographic information and the like can assist doctors in decision making to a certain extent.
The existing disease risk and clinical diagnosis rules are mostly from various medical scales and machine learning predictive models. (1) The medical scale can quantify clinical information, demographic information, various daily habits and the like of patients, assign different scores to different features, and finally measure the disease degree, disease risk and the like in a scoring mode. However, most of the existing medical scales are formulated by foreign people, and factors such as race, daily habit, individual difference and the like are often ignored, and have a certain influence on the accuracy of scale evaluation. (2) The use of machine learning models can improve prediction and diagnostic accuracy to some extent. However, most existing machine learning models cannot directly provide decision rules with interpretability.
Therefore, how to provide a method, a storage medium and a device for extracting clinical information rules based on data driving, so as to solve the defects that the prior art cannot provide a rule extraction scheme with high accuracy and interpretability, and the like, is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method, a storage medium and a device for extracting clinical information rules based on data driving, which are used for solving the problem that the prior art cannot provide a rule extraction scheme with high accuracy and interpretability.
To achieve the above and other related objects, an aspect of the present invention provides a data-driven based clinical information rule extraction method, including: obtaining patient sample data, the patient sample data including individual clinical features of a patient; generating an initial rule set from the patient sample data; screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain a universality rule set; and determining an optimal rule set through the accuracy and the interpretability of each rule in the universality rule set.
In one embodiment of the present invention, the patient sample data is table data without missing values, wherein each row of the table data represents a patient sample and each column represents a feature of the patient.
In one embodiment of the present invention, the step of generating an initial rule set from the patient sample data comprises: preprocessing the patient sample data; for the preprocessed patient sample data, rule extraction is carried out on each node in each generated tree by utilizing a tree model; and generating the initial rule set according to the rule extraction result.
In an embodiment of the present invention, the step of screening the initial rule set based on the timing characteristics in the initial rule set to obtain a universality rule set includes: acquiring the time frequency of regular occurrence on each node by using a time sequence statistical method; and screening out the rule of which the time frequency meets the preset requirement of the user as the universality rule set.
In an embodiment of the present invention, the step of determining the optimal rule set according to the accuracy and the interpretability of each rule in the universality rule set includes: determining an optimal solution by a multi-objective optimization algorithm aiming at each rule in the universality rule set; and determining the combination of all the optimal solution components as the optimal rule set.
In an embodiment of the present invention, the step of determining the optimal solution by the multi-objective optimization algorithm includes: taking the accuracy and the interpretability of each rule as two optimization targets; randomly initializing a particle swarm aiming at the optimization target; determining the fitness of each particle in the particle swarm; updating the speed and the position of the particles according to the fitness; judging whether the maximum iteration times or the global optimal position is reached to meet the minimum authority; if yes, determining the pareto optimal solution.
In an embodiment of the present invention, after the step of determining an optimal rule set by accuracy and interpretability of each rule in the universality rule set, the data-driven clinical information rule extraction method further includes: acquiring prediction data of clinical decisions required by a user; all the obtained prediction data form a prediction data set; and comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rule which is met by the predicted data set according to the matching result of the predicted data and the optimal rule set.
In an embodiment of the present invention, the optimal rule set includes a first rule, a second rule, and a third rule; the step of comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rule which the predicted data set accords with according to the matching result of the predicted data and the optimal rule set comprises the following steps: and determining the user illness probability corresponding to the prediction data set in response to the prediction data simultaneously meeting the first rule, the second rule and the third rule, wherein the user illness probability is used for providing auxiliary judgment information for a doctor in the process of disease diagnosis of the doctor.
To achieve the above and other related objects, the present invention provides in another aspect a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data-driven based clinical information rule extraction method.
To achieve the above and other related objects, a final aspect of the present invention provides an electronic device, including: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the data-driven clinical information rule extraction method.
As described above, the data-driven clinical information rule extraction method, the storage medium and the device according to the present invention have the following beneficial effects:
according to the invention, an initial rule set is generated according to patient sample data, then universal rule screening is carried out according to time sequence characteristics, and the accuracy and the interpretability of each rule are utilized to determine an optimal rule set. Therefore, the problems of low prediction accuracy of the medical scale and poor resolvability of the traditional machine learning model are well solved, and the rule extraction scheme based on data driving provided by the invention can mine a series of rules with high confidence and accuracy from clinical information on the premise of ensuring the accuracy. The clear conclusion path can be effectively obtained, and the doctor is assisted in making decisions to a certain extent.
Drawings
FIG. 1 is a schematic flow chart of a data-driven clinical information rule extraction method according to an embodiment of the invention.
FIG. 2 is a flowchart of determining an optimal rule set according to an embodiment of the data-driven clinical information rule extraction method of the present invention.
FIG. 3 is a flowchart illustrating the calculation of the optimal solution according to an embodiment of the method for extracting clinical information rules based on data driving.
FIG. 4 is a flowchart of a method for extracting clinical information rules based on data driving according to an embodiment of the invention.
Fig. 5 is a schematic structural connection diagram of an electronic device according to an embodiment of the invention.
Description of element reference numerals
5. Electronic equipment
51. Processor and method for controlling the same
52. Memory device
S11 to S16 steps
S141 to S142 steps
Steps S141A to S141F
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the illustrations, not according to the number, shape and size of the components in actual implementation, and the form, number and proportion of each component in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
According to the data-driven-based clinical information rule extraction method, the storage medium and the device, a series of rules with high confidence and accuracy can be mined from clinical information on the premise of ensuring accuracy, so that a clear conclusion path can be effectively obtained, and a doctor can be assisted in making a decision to a certain extent.
The principle and implementation of the data-driven clinical information rule extraction method, storage medium and apparatus of the present embodiment will be described in detail below with reference to fig. 1 to 5, so that those skilled in the art can understand the data-driven clinical information rule extraction method, storage medium and apparatus of the present embodiment without creative effort.
Referring to fig. 1, a schematic flow chart of a data-driven clinical information rule extraction method according to an embodiment of the invention is shown. As shown in fig. 1, the method for extracting clinical information rules based on data driving specifically includes the following steps:
s11, acquiring patient sample data, wherein the patient sample data comprises various clinical characteristics of a patient.
In one embodiment of the present invention, the patient sample data is table data without missing values, wherein each row of the table data represents a patient sample and each column represents a feature of the patient.
In practice, taking pulmonary embolism as an example, laboratory test data for a collection of patients with outcome variables are taken from a hospital-related department as patient sample data.
S12, generating an initial rule set according to the patient sample data.
In one embodiment, S12 specifically includes the following steps:
(1) Preprocessing the patient sample data.
Specifically, the preprocessing includes existing preprocessing means such as data cleaning, data merging, data transformation, data normalization and the like, so as to improve the usability of the patient sample data.
(2) And carrying out rule extraction on each node in each generated tree by utilizing a tree model aiming at the preprocessed patient sample data.
In particular, the tree model may be any robust model such as decision tree, random forest, GBDT (Gradient Boosting Decision Tree, gradient descent tree), xgboost, etc.
In practical application, each node in each generated tree is subjected to rule extraction by utilizing a random forest algorithm. The random forest is a stable integrated learning model, adopts the idea of 'bagging', uses a bootstrap method to generate a plurality of training sets, constructs a decision tree for each training set, and finally combines the classification results of a plurality of decision tree-based classifiers to obtain a relatively better prediction model.
Specifically, given data set D, feature vector X and corresponding label y, let d= (Xi, yi), i=1, 2, …, n. Xi e X, xi= (Xi 1, xi2, …, xim), m is the feature quantity, yi e y= {0,1, … }. Gini (D) is defined to measure the purity of D, and can be expressed as follows:
p in formula 1 k (k=1, 2, …, K) represents the attribute of the kth class sample in the current dataset. k' represents a class other than the k class. The smaller Gini (D), the higher the purity of dataset D. Assuming that feature m has V possible values { m1, m2, …, mv }, dividing data set D using feature m to generate V different branch nodes, wherein the V-th branch is denoted as Dv, and Gini is defined index(,) To represent the uncertainty of feature m in D, can be expressed as:
for training set D, the learning algorithm that constructs the decision tree can be represented as an X-to-y map that uses the lowest-radix index feature after partitioning to cycle the data set D into subsets to form a tree. The selected feature m is expressed as:
and then obtaining classification results by integrating weighted outputs of all decision trees:
in equation 4, ω h Representing the weight of the h tree, a sample may be classified according to the following equation:
in equation 5, S represents the number of trees.
(3) And generating the initial rule set according to the rule extraction result.
Specifically, the initial rule set acquisition mode is: the random forest algorithm obtains the conclusion of the characteristic corresponding rule condition of the nodes in each path and the category corresponding rule of the leaf nodes by traversing the path from the root node to each leaf node in each decision tree.
In practical applications, the type of tree model output is determined by the individual tree output conditions when performing disease prediction or medical diagnosis tasks. Since the tree model is a "white box model" that provides a clear path for each conclusion, rules for all nodes on each tree in the tree model are output as the initial rule set.
And S13, screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain the universality rule set. Therefore, through screening of indexes such as time frequency and the like of the analysis rules, the occurrence of rules which are not universal and corresponding to some black swan events can be effectively avoided.
In one embodiment, S13 specifically includes the following steps:
(1) And obtaining the time frequency of the regular occurrence on each node by using a time sequence statistical method.
In particular, the timing statistics method may be a timing statistics function or other implementation that may implement the timing statistics function.
In practical application, for the statistical analysis process of time sequence data in the rule, the grouping and aggregation function of samples on each node according to time frequency is implemented by using a pandas packet based on python, for example: and counting information with time frequency attributes such as the number of days, weeks, months, years or the start-stop time of the appearance of the sample on the node.
(2) And screening out the rule of which the time frequency meets the preset requirement of the user as the universality rule set.
Specifically, for example, the preset requirement of the user is 1 year, if a certain patient sample data appears within 2 weeks, the rule corresponding to the extraction of the patient sample data has no universality, and if a certain patient sample data appears within 2 years, the rule corresponding to the extraction of the patient sample data has universality.
And S14, determining an optimal rule set according to the accuracy and the interpretability of each rule in the universality rule set.
Referring to fig. 2, a flowchart of determining an optimal rule set according to an embodiment of the data-driven clinical information rule extraction method of the present invention is shown. As shown in fig. 2, S14 specifically includes the following steps:
s141, determining an optimal solution by a multi-objective optimization algorithm according to each rule in the universality rule set. Wherein the multi-objective optimization algorithm is used to balance accuracy and interpretability of rules.
Specifically, the multi-objective optimization algorithm may be any algorithm capable of realizing two objectives and more than two objective optimization analyses, such as a multi-objective particle swarm algorithm, a non-dominant ordering genetic algorithm, a multi-objective evolutionary algorithm, and the like.
Referring to fig. 3, a flowchart of an optimal solution calculation according to an embodiment of the data-driven clinical information rule extraction method of the present invention is shown. As shown in fig. 3, S141 specifically includes the following steps:
S141A, the accuracy and the interpretability of each rule are taken as two optimization targets.
In order to guarantee the accuracy of rule sets, the accuracy of each rule set, namely the ratio of the correctly predicted data sets, is calculated. Rule accuracy is defined as follows:
in equation 6, QACC represents the accuracy of the rule set, Q represents the number of samples, and xi represents the ith sample. To measure the interpretability of a rule, we define it as:
in formula 7, Q FEA 、Q COV 、Q CNT Representing the complexity of the rule, the convergence of the rule and the quality of the rule, respectively. Alpha, beta and gamma are weights of the three, and can be set according to actual conditions. Specifically, Q FEA For finding the number of features of each rule, if the average number of features involved in the rule is small, Q CNT The value is larger. Q (Q) COV For representing the coverage of each rule, Q is the rule when it has strong applicability COV Larger. Q (Q) CNT For measuring the quality of the rule. They are defined as:
in the formula 8 of the present invention,representing the active features in the ith rule, in equation 9, < >>Representing the number of samples matching the ith rule. In formula 10, rule selected Representing the number of rules derived from the algorithm. Z is the number of candidate rules generated. When Q is FEA Only one feature, Q, in the expression rule of =1 FEA The expression rule contains all features when=0. I.e. Q FEA The smaller the rule, the easier the physician will understand at the time of diagnosis.
S141B, randomly initializing a particle swarm aiming at the optimization target.
The invention regards the solution in the optimization problem as "particles", all of which are searched in the N-dimensional space, each particle having only two attributes: position and velocity, velocity representing the speed of movement and position representing the direction of movement. The current position of the particle is a candidate solution to the optimization problem, and the flying process of the particle is the searching process of the individual.
And S141C, determining the fitness of each particle in the particle swarm.
Specifically, an fitness function is defined that is capable of determining individual optimal solutions for each particle, and a global optimal value is found from the individual optimal solutions.
And S141D, updating the speed and the position of the particles according to the adaptability.
Specifically, the flight speed of the particles may be dynamically adjusted based on the historical optimal position of the particles and the historical optimal position of the population. The speed and position of the particles are updated according to the fitness.
S141E, judging whether the maximum iteration number or the global optimal position meets the minimum authority.
The optimal solution searched by each particle is called an individual extremum, and the optimal individual extremum in the particle swarm is used as the current global optimal solution. The iteration is continued, updating the speed and the position. And finally obtaining the optimal solution meeting the termination condition. If the maximum iteration number is not reached or the global optimal position does not satisfy the minimum authority, the process returns to step S141C.
And S141F, if yes, determining the pareto optimal solution.
And determining the pareto optimal solution in the final overall by using a rapid non-dominant sorting method for particles which reach the maximum iteration number or the global optimal position and meet the minimum authority.
And S142, determining the combination of all the optimal solution components as the optimal rule set.
Specifically, for pulmonary arterial embolism, the optimal rule set is: "lower limb varicose vein _ diagnosis _ any >0.5 within 1 month, gender visit count < = 1.5,10000 days age visit last < = 26373.0".
When "1 month in_lower limb varicose vein_diagnosis_any >0.5, 10000 days in_gender_visit_count < = 1.5,10000 days in_age_visit_last < = 26373.0", the probability of patient suffering from VTE is determined to be 90% or more.
Referring to fig. 4, a flowchart of a predicted data matching process according to an embodiment of the invention is shown. As shown in fig. 4, after the step, the data-driven-based clinical information rule extraction method further includes the steps of:
s15, obtaining prediction data of clinical decisions required by a user; all the acquired prediction data constitute a prediction data set.
S16, comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rule which the predicted data set accords with according to the matching result of the predicted data and the optimal rule set.
In one embodiment, the optimal rule set includes a first rule, a second rule, and a third rule.
And determining the user illness probability corresponding to the prediction data set in response to the prediction data simultaneously meeting the first rule, the second rule and the third rule, wherein the user illness probability is used for providing auxiliary judgment information for a doctor in the process of disease diagnosis of the doctor.
Specifically, for pulmonary arterial embolism, the optimal rule set is: "lower limb varicose vein_diagnosis_any >0.5 within 1 month, gender_visit_count < =1.5 within 10000 days, age_visit_last < = 26373.0 within 10000 days. The first rule is 1 month in 1 lower limb varicose vein_diagnosis_any >0.5, the second rule is 10000 days in gender_visit_count < = 1.5, and the third rule is 10000 days in age_visit_last < = 26373.0. When the predicted data corresponding to a certain patient simultaneously meets three rules, the analyzed probability of the patient suffering from pulmonary artery embolism is more than 90%, and after a doctor knows the information of the probability of the patient suffering from pulmonary artery embolism is more than 90%, the doctor can diagnose the disease of the patient according to the information.
The following is an effect comparison analysis of the present invention with the existing machine learning model: the existing machine learning model takes a risk proportion regression model as an example, and simultaneously evaluates the influence of various factors on disease risks or diagnosis results, and a predictable and diagnostic function is obtained by weighting and nonlinear mapping the factors. Taking the example of chronic kidney disease predicting its probability of developing renal failure within five years, the following risk ratio regression model can be obtained:
the accurate prediction result can be obtained through the function, but the rules obtained by weighting or nonlinear operation on factors such as GFR (Glomerular Filtration Rate ), ACR (Autologous Cellular Rejuvenation, autologous cell regeneration), AGE (Advanced Glycation End products, glycosylation end product) and the like are not interpretable, and a series of rules with high confidence and accuracy are mined from clinical information through a multi-objective optimization algorithm on the premise of ensuring accuracy.
The protection scope of the data-driven clinical information rule extraction method is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes of step increase, step decrease and step replacement in the prior art according to the principles of the invention are included in the protection scope of the invention.
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data-driven-based clinical information rule extraction method.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned computer-readable storage medium includes: various computer storage media such as ROM, RAM, magnetic or optical disks may store program code.
Referring to fig. 5, a schematic structural connection diagram of an electronic device according to an embodiment of the invention is shown. As shown in fig. 5, the present embodiment provides an electronic device 5, specifically including: a processor 51 and a memory 52; the memory 52 is configured to store a computer program, and the processor 51 is configured to execute the computer program stored in the memory 52, so that the electronic device 5 performs the steps of the data-driven clinical information rule extraction method.
The processor 51 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Alication Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable GateArray, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The memory 52 may include a random access memory (Random Access Memory, abbreviated as RAM) and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
In practical applications, the electronic device may be a computer including all or part of the components of a memory, a memory controller, one or more processing units (CPUs), a peripheral interface, an RF circuit, an audio circuit, a speaker, a microphone, an input/output (I/O) subsystem, a display screen, other output or control devices, and an external port; the computer includes, but is not limited to, a personal computer such as a desktop computer, a notebook computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA for short), and the like. In other embodiments, the electronic device may also be a server, where the server may be disposed on one or more entity servers according to multiple factors such as functions, loads, and the like, and may also be a cloud server formed by a distributed or centralized server cluster, which is not limited in this embodiment.
In summary, the data-driven clinical information rule extraction method, the storage medium and the device generate the initial rule set according to the patient sample data, perform universality rule screening according to time sequence characteristics, and determine the optimal rule set by utilizing the accuracy and the interpretability of each rule. Therefore, the problems of low prediction accuracy of the medical scale and poor resolvability of the traditional machine learning model are well solved, and the rule extraction scheme based on data driving provided by the invention can mine a series of rules with high confidence and accuracy from clinical information on the premise of ensuring the accuracy. The clear conclusion path can be effectively obtained, and the doctor is assisted in making decisions to a certain extent. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.
Claims (9)
1. The data-driven clinical information rule extraction method is characterized by comprising the following steps of:
obtaining patient sample data, the patient sample data including individual clinical features of a patient; the patient sample data is table data without missing values;
generating an initial rule set from the patient sample data; wherein the patient sample data is pre-processed; for the preprocessed patient sample data, rule extraction is carried out on each node in each generated tree by utilizing a tree model; generating the initial rule set according to the rule extraction result;
screening the initial rule set based on the time sequence characteristics in the initial rule set to obtain a universality rule set;
and determining an optimal rule set through the accuracy and the interpretability of each rule in the universality rule set.
2. The data-driven clinical information rule extraction method according to claim 1, wherein:
the patient sample data is tabular data without missing values, wherein each row of the tabular data represents a patient sample and each column represents a characteristic of the patient.
3. The method for extracting clinical information rules based on data driving according to claim 1, wherein the step of screening the initial rule set based on the time sequence features in the initial rule set to obtain a universality rule set comprises:
acquiring the time frequency of regular occurrence on each node by using a time sequence statistical method;
and screening out the rule of which the time frequency meets the preset requirement of the user as the universality rule set.
4. The data-driven clinical information rule extraction method according to claim 1, wherein the step of determining an optimal rule set by accuracy and interpretability of each rule in the universality rule set comprises:
determining an optimal solution by a multi-objective optimization algorithm aiming at each rule in the universality rule set;
and determining the combination of all the optimal solution components as the optimal rule set.
5. The method of claim 4, wherein the determining the optimal solution by a multi-objective optimization algorithm comprises:
taking the accuracy and the interpretability of each rule as two optimization targets;
randomly initializing a particle swarm aiming at the optimization target;
determining the fitness of each particle in the particle swarm;
updating the speed and the position of the particles according to the fitness;
judging whether the maximum iteration times or the global optimal position is reached to meet the minimum authority;
if yes, determining the pareto optimal solution.
6. The data-driven clinical information rule extraction method according to claim 1, wherein after the step of determining an optimal rule set by accuracy and interpretability of each rule in the universality rule set, the data-driven clinical information rule extraction method further comprises:
acquiring prediction data of clinical decisions required by a user; all the obtained prediction data form a prediction data set;
and comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rule which is met by the predicted data set according to the matching result of the predicted data and the optimal rule set.
7. The data-driven based clinical information rule extraction method according to claim 6, wherein the optimal rule set includes a first rule, a second rule, and a third rule; the step of comparing the predicted data with the rules in the optimal rule set one by one, and obtaining the rule which the predicted data set accords with according to the matching result of the predicted data and the optimal rule set comprises the following steps:
and determining the user illness probability corresponding to the prediction data set in response to the prediction data simultaneously meeting the first rule, the second rule and the third rule, wherein the user illness probability is used for providing auxiliary judgment information for a doctor in the process of disease diagnosis of the doctor.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the data-driven based clinical information rule extraction method according to any one of claims 1 to 7.
9. An electronic device, comprising: a processor and a memory;
the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the electronic device to execute the data-driven-based clinical information rule extraction method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111500068.2A CN114121296B (en) | 2021-12-09 | 2021-12-09 | Data-driven clinical information rule extraction method, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111500068.2A CN114121296B (en) | 2021-12-09 | 2021-12-09 | Data-driven clinical information rule extraction method, storage medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114121296A CN114121296A (en) | 2022-03-01 |
CN114121296B true CN114121296B (en) | 2024-02-02 |
Family
ID=80364078
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111500068.2A Active CN114121296B (en) | 2021-12-09 | 2021-12-09 | Data-driven clinical information rule extraction method, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114121296B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059214A (en) * | 2023-07-21 | 2023-11-14 | 南京智慧云网络科技有限公司 | Clinical scientific research data integration and intelligent analysis system and method based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103326353A (en) * | 2013-05-21 | 2013-09-25 | 武汉大学 | Environmental economic power generation dispatching calculation method based on improved multi-objective particle swarm optimization algorithm |
CN111489827A (en) * | 2020-04-10 | 2020-08-04 | 吉林大学 | Thyroid disease prediction modeling method based on associative decision tree |
CN112071420A (en) * | 2020-08-12 | 2020-12-11 | 福建中榕数据科技有限公司 | Clinical aid decision making method, system, equipment and medium based on real-time data |
AU2020103709A4 (en) * | 2020-11-26 | 2021-02-11 | Daqing Oilfield Design Institute Co., Ltd | A modified particle swarm intelligent optimization method for solving high-dimensional optimization problems of large oil and gas production systems |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11538586B2 (en) * | 2019-05-07 | 2022-12-27 | International Business Machines Corporation | Clinical decision support |
-
2021
- 2021-12-09 CN CN202111500068.2A patent/CN114121296B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103326353A (en) * | 2013-05-21 | 2013-09-25 | 武汉大学 | Environmental economic power generation dispatching calculation method based on improved multi-objective particle swarm optimization algorithm |
CN111489827A (en) * | 2020-04-10 | 2020-08-04 | 吉林大学 | Thyroid disease prediction modeling method based on associative decision tree |
CN112071420A (en) * | 2020-08-12 | 2020-12-11 | 福建中榕数据科技有限公司 | Clinical aid decision making method, system, equipment and medium based on real-time data |
AU2020103709A4 (en) * | 2020-11-26 | 2021-02-11 | Daqing Oilfield Design Institute Co., Ltd | A modified particle swarm intelligent optimization method for solving high-dimensional optimization problems of large oil and gas production systems |
Also Published As
Publication number | Publication date |
---|---|
CN114121296A (en) | 2022-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lan et al. | A survey of data mining and deep learning in bioinformatics | |
Xia et al. | Complete random forest based class noise filtering learning for improving the generalizability of classifiers | |
Zhang et al. | Local density adaptive similarity measurement for spectral clustering | |
CN110929029A (en) | Text classification method and system based on graph convolution neural network | |
CN103559504A (en) | Image target category identification method and device | |
CN109817339B (en) | Patient grouping method and device based on big data | |
Teng et al. | Customer credit scoring based on HMM/GMDH hybrid model | |
Khan et al. | Machine learning facilitated business intelligence (Part II) Neural networks optimization techniques and applications | |
CN112102899A (en) | Construction method of molecular prediction model and computing equipment | |
WO2023185925A1 (en) | Data processing method and related apparatus | |
CN114121296B (en) | Data-driven clinical information rule extraction method, storage medium and equipment | |
Shrestha et al. | Supervised machine learning for early predicting the sepsis patient: modified mean imputation and modified chi-square feature selection | |
Quach et al. | Evaluation of the efficiency of the optimization algorithms for transfer learning on the rice leaf disease dataset | |
Saravanan et al. | Prediction of Insufficient Accuracy for Human Activity Recognition using Convolutional Neural Network in Compared with Support Vector Machine | |
Mahapatra et al. | MRMR-SSA: a hybrid approach for optimal feature selection | |
CN111159481A (en) | Edge prediction method and device of graph data and terminal equipment | |
CN115936841A (en) | Method and device for constructing credit risk assessment model | |
CN116383441A (en) | Community detection method, device, computer equipment and storage medium | |
CN115879508A (en) | Data processing method and related device | |
Saranya et al. | FBCNN-TSA: An optimal deep learning model for banana ripening stages classification | |
Cai et al. | Improved EfficientNet for corn disease identification | |
CN113393303A (en) | Article recommendation method, device, equipment and storage medium | |
CN115420866A (en) | Drug activity detection method, device, electronic equipment and storage medium | |
Wålinder | Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis | |
Vinutha et al. | EPCA—enhanced principal component analysis for medical data dimensionality reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |