CN116486939B - Data mining method and system for medicine knowledge graph and electronic equipment - Google Patents
Data mining method and system for medicine knowledge graph and electronic equipment Download PDFInfo
- Publication number
- CN116486939B CN116486939B CN202211670808.1A CN202211670808A CN116486939B CN 116486939 B CN116486939 B CN 116486939B CN 202211670808 A CN202211670808 A CN 202211670808A CN 116486939 B CN116486939 B CN 116486939B
- Authority
- CN
- China
- Prior art keywords
- diagnosis
- medicine
- data
- drug
- prescriptions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000003814 drug Substances 0.000 title claims abstract description 329
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000007418 data mining Methods 0.000 title claims abstract description 22
- 229940079593 drug Drugs 0.000 claims abstract description 149
- 238000012549 training Methods 0.000 claims abstract description 49
- 238000000605 extraction Methods 0.000 claims abstract description 16
- 230000004927 fusion Effects 0.000 claims abstract description 15
- 239000013589 supplement Substances 0.000 claims abstract description 7
- 238000003745 diagnosis Methods 0.000 claims description 158
- 238000002372 labelling Methods 0.000 claims description 13
- 238000005065 mining Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 230000001502 supplementing effect Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 13
- 238000012545 processing Methods 0.000 abstract description 5
- 238000012795 verification Methods 0.000 abstract description 4
- 230000036541 health Effects 0.000 abstract description 2
- 238000013473 artificial intelligence Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 206010067484 Adverse reaction Diseases 0.000 description 1
- 206010020751 Hypersensitivity Diseases 0.000 description 1
- 230000006838 adverse reaction Effects 0.000 description 1
- 208000026935 allergic disease Diseases 0.000 description 1
- 230000007815 allergy Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000002651 drug therapy Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000035935 pregnancy Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Pathology (AREA)
- Biomedical Technology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Computational Linguistics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention relates to processing of medical health information data, and discloses a data mining method, a system and electronic equipment of a medicine knowledge graph, wherein the method comprises the following steps: the method comprises a data acquisition step, a feature extraction step, a rule judgment step, a model training step, a data fusion step, an iteration step and a model identification step. According to the invention, the characteristics related to the medicines are excavated from the prescription big data, the accurate judgment logic and judgment conditions are designed according to the characteristics, the correct medicine relation data obtained by the judgment logic and judgment conditions are used as verification samples and training supplement samples of the algorithm model, and the result which is verified to be wrong is supplemented to the training set after being remarked, so that the training samples learned by the algorithm model are ensured to be sufficiently accurate and comprehensive, and finally the obtained algorithm model can obtain the recognition accuracy meeting the requirements.
Description
Technical Field
The present invention relates to processing of medical health information data, and in particular, to a method, a system, and an electronic device for mining data of a drug knowledge graph.
Background
The medicine knowledge graph is essentially a knowledge base of medicines, and generally comprises dimensions of indications, indication ages, usage, dosage, frequency of medication, treatment course, interaction, adverse reaction and the like of the medicines. The process of constructing the medicine knowledge graph mainly aims at mining the knowledge of the dimension of the medicine. The existing method mainly extracts/digs knowledge from a medicine instruction book or a book, has low manual arrangement efficiency, relatively poor extraction accuracy of an artificial intelligence technology and complex modeling. A large number of medical prescriptions are stored in a hospital or an Internet medical server, and can be used for covering a large number of common medicines and common knowledge, and the accuracy of the extracted indication is higher than that of medicine specifications and books, and the extracted indication is easier to implement on the ground.
However, the diagnosis of a medical prescription has a large number of irregular descriptions, and medicines and diagnoses have a large number of cases of staggering and missing, and when medicines and diagnoses in the prescription are not matched, the medicines and diagnoses are generally wrong in two angles, the medicines can be considered to be wrongly prescribed from the perspective of diagnosis, and the diagnoses under doctors can be considered to be wrong from the perspective of medicines. Thus, there is an urgent need for a method or system that can accurately extract the correct medication and diagnostic data from the prescription data.
Disclosure of Invention
In view of the above-mentioned drawbacks or shortcomings in the prior art, the present invention provides a data mining method, system and electronic device for drug knowledge graph, which can quickly mine correct drug and diagnosis knowledge from prescription big data, thereby accurately and efficiently constructing drug knowledge graph.
The invention provides a data mining method of a medicine knowledge graph, which comprises the following steps:
a data acquisition step of acquiring structured prescription data from a database, the prescription data including patient information, hospital information, and drug information;
a feature extraction step of extracting feature data related to a drug in the prescription data;
a rule judging step of determining a preset rule set according to the extracted characteristic data related to the medicine, and determining a correct medicine relation data set according to the preset rule set;
a model training step, namely extracting part of the characteristic data related to the medicines for marking, taking the characteristic data related to the medicines as the characteristics of a judgment model to be trained, and training to obtain the judgment model according to the marked result and the characteristics of the judgment model to be trained;
a data fusion step, namely re-judging the data in the correct medicine relation data set by using the judgment model obtained through training, re-labeling the medicine relation data with the judgment result being wrong, supplementing the re-labeled data into part of labeled characteristic data, and re-training the judgment model;
an iteration step, wherein the model training step and the data fusion step are repeatedly executed until the judging result and the labeling result of the judging model on the correct medicine relation data set determined by the preset rule set are within a preset error, so as to obtain a final judging model;
and a model identification step, namely identifying the characteristic data related to the medicines which are not marked by using a final judgment model.
Further, the step of re-judging the correct medicine relation data set by using the judgment model obtained by training and re-labeling medicine relation data with the wrong judgment result specifically includes the following steps:
and scoring the correct medicine relation data set determined by the preset rule set by using the judgment model obtained through training, and marking the correct medicine relation data set corresponding to the score smaller than the preset threshold value again.
Further, the feature data related to the medicine is feature data related to the medicine and the diagnosis, and the feature extraction step further includes:
calculating confidence scores for the drugs:
h s '=(1-β) n ;
where h is the total confidence of the drug, S is the total number of prescriptions where the drug appears, h' s Is the confidence score for the drug in each prescription, β represents the weight attenuation value, n represents the number of drugs;
and calculating a prescription confidence score i according to the difference between the frequency b of each medicine occurrence in all prescriptions and the confidence h of the medicine.
Further, the feature data related to the medicine is feature data related to the medicine and the diagnosis, and the feature extraction step further includes:
calculating a prescription quality score rate p according to the ratio of the prescription confidence score i to the frequency b of each medicine in all prescriptions;
calculating the probability c of each diagnosis in all prescriptions according to the ratio of the times a of each diagnosis in all prescriptions to the number z of all prescriptions;
calculating the probability d of each medicine in all prescriptions according to the ratio of the times b of each medicine in all prescriptions to the number z of all prescriptions;
the ratio e of the probability d of each drug occurrence and the probability c of each diagnosis occurrence is calculated.
Further, the feature data related to the medicine is feature data related to the medicine and the diagnosis, and the feature extraction step further includes:
calculating the probability g of the simultaneous occurrence of the medicine and the diagnosis according to the ratio of the number f of the simultaneous occurrence of the medicine and the diagnosis in the prescription to the number a of the simultaneous occurrence of each diagnosis in all prescriptions;
counting the number j of different hospitals when each medicine and diagnosis appear in different hospital prescriptions;
counting the simultaneous occurrence number of medicines, diagnoses and hospital names, and counting the maximum occurrence number k of the hospital names in each medicine and diagnosis;
and calculating the ratio L of the maximum times of the hospital names in each medicine and diagnosis according to the ratio of the maximum times k of the hospital names in each medicine and diagnosis to the times f of the simultaneous occurrence of the medicine and the diagnosis in the prescription.
Further, the feature data related to the medicine is feature data related to the medicine and the diagnosis, and the feature extraction step further includes:
counting the number m of different departments when each medicine and diagnosis are in different department prescriptions;
counting the simultaneous occurrence number of the medicines, the diagnoses and the department names, and counting the maximum occurrence number n of the department names in each medicine and the diagnoses;
and calculating the duty ratio o of the maximum times of the names of the departments in each medicine and diagnosis according to the ratio of the maximum times n of the names of the departments in each medicine and diagnosis to the times f of the simultaneous occurrence of the medicines and the diagnoses in the prescription.
Further, a preset rule set is set according to the extracted characteristic data related to the medicine and diagnosis, and a preliminary correct medicine diagnosis data set and a preliminary incorrect medicine diagnosis data set are obtained according to the preset rule set.
Further, a data set which belongs to the preliminary correct medicine relation data set and does not belong to the preliminary incorrect medicine diagnosis data set is used as a correct medicine diagnosis data set determined according to the preset rule set.
The second aspect of the present invention also provides a data mining system for drug knowledge graph, comprising:
a data acquisition module configured to acquire structured prescription data from a database, the prescription data including patient information, hospital information, and drug information;
a feature extraction module configured to extract feature data related to a drug in the prescription data;
the rule judging module is configured to determine a preset rule set according to the extracted characteristic data related to the medicines and determine a correct medicine relation data set according to the preset rule set;
the model training module is configured to extract part of the characteristic data related to the medicines for marking, take the characteristic data related to the medicines as the characteristics of a judging model to be trained, and train to obtain the judging model according to the marked result and the characteristics of the judging model to be trained;
the data fusion module is configured to re-judge the data in the correct medicine relation data set by using the judgment model obtained through training, re-label the medicine relation data with the judgment result being wrong, and supplement the re-labeled data into part of the labeled characteristic data, and re-train the judgment model;
the iteration module is configured to repeatedly execute the model training step and the data fusion step until the judgment result of the judgment model on the correct medicine relation data set determined by the preset rule set is the same as the labeling result, so as to obtain a final judgment model;
and the model identification module is configured to identify the characteristic data related to the medicine which is not marked by the final judgment model.
In a third aspect of the present invention, there is also provided an electronic apparatus, including:
one or more processors;
storage means for storing one or more computer programs;
the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the data mining method of the drug knowledge graph described above.
The data mining method, the system and the electronic equipment for the medicine knowledge graph can rapidly mine correct medicine and diagnosis knowledge from prescription big data, so that the medicine knowledge graph can be constructed accurately and efficiently.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:
fig. 1 is a flow chart of a data mining method of a drug knowledge graph according to an embodiment of the present application;
FIG. 2 is a schematic diagram of prescription structuring data used in a method for mining data of a drug knowledge graph according to one embodiment of the present application;
FIG. 3 is a schematic diagram of feature data related to medicine and diagnosis mined in a data mining method of medicine knowledge graph according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a data mining system for drug knowledge-graph according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present invention to describe the acquisition modules, these acquisition modules should not be limited to these terms. These terms are only used to distinguish the acquisition modules from each other.
Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
It should be noted that, the terms "upper", "lower", "left", "right", and the like in the embodiments of the present invention are described in terms of the angles shown in the drawings, and should not be construed as limiting the embodiments of the present invention. In addition, in the context, it will also be understood that when an element is referred to as being formed "on" or "under" another element, it can be directly formed "on" or "under" the other element or be indirectly formed "on" or "under" the other element through intervening elements.
The diagnosis of a medical prescription has a great deal of irregular description, medicines and diagnoses have a great deal of staggered and missed situations, when medicines and diagnoses in the prescription are not matched, the medicines and the diagnoses are generally wrong in two angles, the medicines can be considered to be wrongly arranged from the diagnosis point of view, and the diagnoses under doctors can be considered to be wrong from the medicine point of view. The traditional manual arrangement and machine extraction modes are difficult to quickly obtain accurate medicine knowledge patterns. In order to overcome the problems in the prior art, the invention provides a data mining method, a system and electronic equipment of a medicine knowledge graph, and the method can accurately extract correct medicine and diagnosis data from big data of a medical prescription, so that an effective and usable medicine knowledge graph is formed.
Referring to fig. 1, an embodiment of the present invention provides a data mining method for a drug knowledge graph, including the following steps:
a data acquisition step S101 of acquiring structured prescription data including patient information, hospital information, and drug information from a database.
Specifically, the embodiment obtains a large amount of structured prescription data from the prescription big data database, and the prescription data of the invention can be colloquially understood as the data on the prescription list when the patient goes to the hospital to take medicine. Generally, prescription data includes, but is not limited to, age, sex, department, hospital name, doctor name, diagnosis, medicine name, medication method of medicine, single dose of medicine, medication time of medicine, medication frequency of medicine, total amount of prescribed medicine, and allergy information of the patient. Referring to fig. 2, in the structured prescription data, a hospital, a part, a department, a pregnant, a pregnancy status, a prescription, a cfda_code, an approval document of a medicine, and the like are indicated by a hospital.
And a data cleaning step S102, namely cleaning and filtering unqualified data in the prescription data to obtain qualified prescription data.
Specifically, the medical prescription content is not necessarily statistically significant data, and therefore, it is necessary to clean out unqualified prescriptions, including prescriptions without patient age, prescriptions without diagnosis, prescriptions with a number of medicines greater than a preset number (e.g., 5), prescriptions without a single medication, prescriptions with two identical medicine names in one prescription, and the like. In addition, since there may be a plurality of diagnoses in one prescription, it is necessary to divide the fields of the diagnoses by division symbols (comma, pause, semicolon, etc.). It is particularly noted that if the number of medicine names in a prescription is greater than two, a prescription having a diagnosis number of 2 or more needs to be filtered out because the prescription may not know the correspondence between medicine and diagnosis if there are a plurality of diagnoses and a plurality of medicine names.
And a feature extraction step S103, extracting feature data related to the medicine in the prescription data.
Specifically, the characteristic data related to the medicine in the prescription includes: characteristic data related to medicines and diagnostics, characteristic data related to medicines and medicines, characteristic data related to medicines and age, characteristic data related to medicines and gender, characteristic data related to medicines and modes of administration, and the like. The present embodiment will be described taking feature data related to medicine and diagnosis as an example.
This step requires that the characteristics of the relationship between the drug and the diagnosis be mined by using a statistical method according to the experience of the expert, and the final characteristic mining result is a data table, see fig. 3, with a number of columns, the main key of the table being the drug name and the diagnosis, and the remaining columns being referred to as characteristics related to the drug and the diagnosis.
The following will be mainly described, and it is these statistical features that are different from the prior art that are specifically mined in this embodiment that play an important role in subsequent rule judgment and model recognition, so that the identified medicine and diagnostic data are more effective and accurate.
(1) The number z of all prescriptions is counted.
(2) The number of occurrences of each diagnosis in all prescriptions, a, is counted.
(3) Counting the occurrence times b of each medicine in all prescriptions.
(4) The probability of each diagnosis occurring in all prescriptions, c=a/z, is calculated.
(5) The probability d, d=b/z of each drug present in all prescriptions is calculated.
(6) The ratio e of the probability of each drug occurrence to the probability of each diagnosis occurrence is calculated, where e=d/c.
(7) The number f of co-occurrence (simultaneous occurrence) of the drug and the diagnosis is counted.
(8) The probability of drug and diagnostic co-occurrence, g, g=f/a, is calculated.
(9) Calculating confidence scores for the drugs:
h s '=(1-β) n ;
where h is the total confidence of the drug, S is the total number of prescriptions where the drug appears, h s ' is the confidence score for the drug in each prescription, β represents the weight attenuation value, n represents the number of drugs; where 1 represents the original score weight, β represents the weight attenuation value, its empirical value is 0.3, and n represents the number of drugs.
(10) A prescription confidence score i is calculated, where i = b-h.
In the above formula, b is the occurrence frequency of the medicine and can be regarded as the medicine confidence score when the score weight is 1, so b-h can be regarded as the confidence score when the score weight is 1 and the score obtained by subtracting the weight according to the quantity of the medicine in one prescription. i indicates that the larger the difference, the more prescriptions for this drug and the number of drugs are present. In general, since a single disease is usually treated with two medicines, a large number of medicines in a prescription indicates that each medicine has the same efficacy for the diagnosis, which may occur when the auxiliary medicine is used or the diagnosis is written less.
(11) And counting the number j of different hospitals when each medicine and diagnosis are in different prescriptions of the hospitals.
(12) Counting the co-occurrence number of medicines, diagnoses and hospital names, and taking the maximum number k of occurrence of the hospital names in each medicine and diagnosis.
(13) The duty cycle L, l=k/f of the maximum number of hospital names inside each drug, diagnosis is calculated.
(14) Counting the number m of different departments when each medicine and diagnosis are in different department prescriptions.
(15) Counting the co-occurrence number of the medicines, the diagnoses and the department names, and taking the maximum number n of occurrences of the department names in each medicine and the diagnosis.
(16) The ratio o, o=n/f of the maximum number of department names in each medicine, diagnosis is calculated.
(17) A prescription quality score rate p is calculated, where p=i/b.
The characteristic data related to medicines and diagnosis, which are excavated according to the method, are the basis for subsequent judgment and calculation.
And a rule judging step S104, wherein a preset rule set is determined according to the extracted characteristic data related to the medicines, and a correct medicine relation data set is determined according to the preset rule set.
Specifically, this embodiment will be described by taking a correct drug diagnosis relationship data set as an example. To obtain the correct drug diagnosis data set from the statistical features, the traditional method either finds the correct drug diagnosis data set through accurate logic judgment or judges by using an artificial intelligence algorithm model. The logic judgment is also called a design rule set, i.e. according to characteristic data related to medicines and diagnosis, logic judgment or judgment conditions are designed, correct medicine diagnosis data or data sets are satisfied, and incorrect medicine diagnosis data or data sets are not satisfied. The model judging method is characterized in that a part of correct medicine diagnosis and incorrect medicine diagnosis data sets are marked manually or by a machine, then the characteristics related to medicine and diagnosis are used as the characteristics of a training model, the machine learning model or the deep learning model is used for training, after model training is completed, unlabeled data is scored by using the obtained model, and the data exceeding a score threshold value are correct medicine diagnosis data.
However, the content, item and type of medical prescription are complex, the corresponding relation between medicine and diagnosis needs strict logic judgment, and the prior art does not have practical and effective judgment logic. Although the artificial intelligence judgment model (machine learning model or deep learning model) can simulate the human brain to learn the correct corresponding relation of medicines and diagnoses in the prescription, the model is completely dependent on the model characteristics of the artificial intelligence model and the types of sample data marked in advance, and if the types of the sample data marked in advance are not many, not comprehensive or not right, the artificial intelligence judgment model still has difficulty in obtaining the judgment accuracy meeting the requirement.
In order to overcome the defects, the condition judgment method and the algorithm model identification method are fused, a set of accurate judgment logic and judgment conditions are designed on one hand, on the other hand, correct medicine diagnosis data obtained according to the judgment conditions and judgment logic are verified again by using the algorithm model, and the result verified to be wrong is remarked and then is supplemented to a training set, so that a training sample learned by the algorithm model can be ensured to be sufficiently accurate and comprehensive, and finally the obtained algorithm model can obtain the identification accuracy meeting the requirements. That is, the judgment logic and judgment conditions are not ultimately used for the identification of the correct drug diagnostic data, but the correct drug diagnostic data obtained by the judgment logic and judgment conditions is taken as a verification sample and a training supplement sample of the algorithm model.
Specifically to this step, correct drug diagnosis data is obtained according to the following judgment logic and judgment conditions, and these correct drug diagnosis data are used as verification samples and training supplement samples of the subsequent algorithm model.
Specifically, the judgment logic and judgment conditions for the correct drug diagnosis data are as follows:
when the same medicine and diagnosis are in prescriptions of different departments and different hospitals, if the number m of the different departments is larger than a first preset value and the number j of the different hospitals is larger than a second preset value, the medicine and diagnosis in the prescriptions are considered as preliminary correct medicine diagnosis data. For example, if m > =4 and j > =3, a drug diagnosis that occurs in at least four departments and at least three hospitals is indicated. Since this condition indicates that at least three doctors prescribe the drug for the diagnosis, there is generally no erroneous data, so the drug diagnosis data of m > =4 and j > =3 should be correct.
And if the prescription quality score rate p is smaller than a third preset value and the number f of simultaneous occurrence times f of the medicine and the diagnosis in the prescription is larger than a fourth preset value, the medicine and the diagnosis in the prescription are regarded as preliminary correct medicine diagnosis data. For example, if p <0.2 and f >100 indicates that the number of co-occurrence times of the medicine and the diagnosis is greater than 100, and the medicine and the diagnosis are simultaneously present with few prescriptions for a large number of kinds of medicines, both the medicine and the diagnosis data should be correct.
And if the frequency f of simultaneous occurrence of the medicine and the diagnosis in the prescription is larger than a fourth preset value and the probability c of occurrence of the diagnosis in the prescription is smaller than a fifth preset value, the medicine and the diagnosis in the prescription are identified as preliminary correct medicine diagnosis data. For example, if f >100 and c <0.0005 indicate that the number of co-occurrence of a drug and a diagnosis is greater than 100, and the diagnosis is a relatively rare diagnosis, i.e., the diagnosis is less frequent, but frequently occurs with the drug, then both the drug and the diagnostic data should be correct.
The set of data satisfying the above conditions is taken as a preliminary correct drug diagnosis data set.
Further, the judgment logic and judgment conditions of the erroneous drug diagnosis data are as follows:
and if the ratio e of the probability d of each medicine to the probability c of each diagnosis is larger than a sixth preset value, the medicine and the diagnosis in the prescription are identified as preliminary wrong medicine diagnosis data. For example, if e >0.5, it indicates that a frequently occurring drug is used for a less frequently occurring diagnosis, at which point the drug therapy diagnostic data is likely to be erroneous.
And if the occurrence times b of the medicine in all prescriptions are smaller than a seventh preset value, the medicine and the corresponding diagnosis are identified as wrong medicine diagnosis data. For example, if b <5 represents a drug that occurs less than 5 times, the data for that drug is not statistically significant.
The set of data satisfying the above conditions is taken as a preliminary erroneous drug diagnosis data set.
Further, some of the drug and diagnostic data sometimes belong to both the preliminary correct drug diagnostic data set and the preliminary incorrect drug diagnostic data set. In this case, in order to ensure accuracy of the finally obtained data, a data set which belongs to the preliminary correct medicine diagnosis data set and which does not belong to the preliminary incorrect medicine diagnosis data set is taken as a correct medicine diagnosis data set determined according to the judgment logic and the judgment condition.
And a model training step S105, namely, carrying out partial labeling on the characteristic data related to the medicine, taking the characteristic data related to the medicine as the characteristic of a judgment model to be trained, and training to obtain the judgment model according to the result of the partial labeling and the characteristic of the judgment model to be trained.
Specifically, this step will be described with reference to feature data related to medicine and diagnosis and a correct medicine diagnosis data set. A part of correct medicine diagnosis data set and wrong medicine diagnosis data set are marked manually to be used as training data sets, and then the characteristics related to medicines and diagnosis obtained through mining in the step S103 are used as the characteristics of a training model. The training model can use an xgboost model, the parameter learning rate of the model is 0.01, the tree depth is 5, the row random and column random parameters are 0.8, and the early_stop parameter is 100. And obtaining the actually used judgment model after model training is completed.
And a data fusion step S106, wherein the data in the correct medicine relation data set is judged again by using the judgment model obtained through training, the medicine relation data with the judgment result being wrong is labeled again, the labeled data is supplemented to part of the labeled characteristic data, and the training of the judgment model is carried out again.
Specifically, the correct drug diagnosis dataset obtained in step S104 is scored by using the judgment model obtained by training, the dataset smaller than the threshold score (i.e., the judgment model is considered as the dataset of the error diagnosis indication) is manually marked or machine marked, the correct or error is marked, after the marking is finished, the training dataset is supplemented, and the model is retrained.
And (S107) repeating the model training step S105 and the data fusion step S106 until the judgment result of the judgment model on the correct medicine relation data set determined by the preset rule set is the same as the labeling result, and obtaining a final judgment model.
Specifically, the above model training step S105 and the data fusion step S106 are repeatedly performed until the data (i.e., the erroneous drug diagnosis data) given by the judgment model that is smaller than the threshold score is manually judged to be the erroneous drug diagnosis data basically (i.e., when the accuracy of the model reaches a certain value), to obtain the final judgment model.
And a model identification step S108, wherein the characteristic data related to the medicine which is not marked is identified by a final judgment model.
Specifically, unlabeled data is scored by a final judgment model, and if the score is higher than a threshold value, the unlabeled data is considered to be correct drug diagnosis data.
According to the method, the characteristics related to medicines and diagnosis are mined from prescription big data, a whole set of accurate judgment logic and judgment conditions are designed according to the characteristics, correct medicine diagnosis data obtained by the judgment logic and the judgment conditions are used as verification samples and training supplement samples of an algorithm model, and the results verified to be incorrect are supplemented to the training set after being re-labeled, so that the training samples learned by the algorithm model can be ensured to be sufficiently accurate and comprehensive, and finally the obtained algorithm model can obtain the recognition accuracy meeting the requirements.
Referring to fig. 4, another embodiment of the present invention further provides a data mining system 200 for drug knowledge graph, which includes a data acquisition module 201, a feature extraction module 202, a rule judgment module 203, a model training module 204, a data fusion module 205, an iteration module 206, and a model identification module 207. The data mining system 200 of the drug knowledge graph can perform the data mining method in the above-described method embodiment.
Specifically, the data mining system 200 includes:
a data acquisition module 201 configured to acquire structured prescription data from a database, the prescription data including patient information, hospital information, and drug information;
a feature extraction module 202 configured to extract feature data related to a drug in the prescription data;
a rule judging module 203 configured to determine a preset rule set according to the extracted feature data related to the medicine, and determine a correct medicine relationship data set according to the preset rule set;
the model training module 204 is configured to perform partial labeling on the feature data related to the medicine, take the feature data related to the medicine as the feature of a judgment model to be trained, and train to obtain the judgment model according to the result of the partial labeling and the feature of the judgment model to be trained;
the data fusion module 205 is configured to re-judge the data in the correct medicine relation data set by using the judgment model obtained by training, re-label the medicine relation data with the judgment result being wrong, supplement the re-labeled data into part of the labeled characteristic data, and re-train the judgment model;
the iteration module 206 is configured to repeatedly execute the model training step and the data fusion step until the judging result of the judging model on the correct medicine relation data set determined by the preset rule set is the same as the labeling result, so as to obtain a final judging model;
the model identification module 207 is configured to identify the feature data related to the drug, which is not labeled, by using a final judgment model.
It should be noted that, the technical solutions corresponding to the data mining system 200 provided in this embodiment that may be used to execute the embodiments of the method are similar to the methods in terms of implementation principle and technical effect, and are not repeated here.
In another embodiment of the present invention, there is provided an electronic device for performing the steps of the above method embodiments, including one or more processors; storage means for storing one or more computer programs; the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method as in the method embodiments.
Referring specifically to fig. 5, a schematic structural diagram of an electronic device 500 is shown. The electronic device 500 in the present embodiment may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), wearable electronic devices, and the like, and fixed terminals such as digital TVs, desktop computers, smart home devices, and the like. The electronic device shown in fig. 5 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501 that may perform various suitable actions and processes to implement the methods of the embodiments as described herein, according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. Alternative implementations or with more or fewer devices are possible.
The foregoing description is only of the preferred embodiments of the invention. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.
Claims (8)
1. The data mining method of the medicine knowledge graph is characterized by comprising the following steps of:
a data acquisition step of acquiring structured prescription data from a database, the prescription data including patient information, hospital information, and drug information;
a feature extraction step of extracting feature data related to medicines and diagnosis from the prescription data; calculating confidence scores for the drugs:
;
;
where h is the total confidence of the drug, S is the total number of prescriptions for the drug,is the confidence score for the drug in each prescription, β represents the weight attenuation value, n represents the number of drugs;
calculating a prescription confidence score i according to the difference between the occurrence times b of each medicine in all prescriptions and the confidence h of the medicine;
calculating a prescription quality score rate p according to the ratio of the prescription confidence score i to the frequency b of each medicine in all prescriptions;
calculating the probability c of each diagnosis in all prescriptions according to the ratio of the times a of each diagnosis in all prescriptions to the number z of all prescriptions;
calculating the probability d of each medicine in all prescriptions according to the ratio of the times b of each medicine in all prescriptions to the number z of all prescriptions;
calculating the ratio e of the probability d of each drug occurrence to the probability c of each diagnosis occurrence;
a model training step, namely extracting part of the characteristic data related to the medicines and the diagnosis for marking, taking the characteristic data related to the medicines and the diagnosis as the characteristics of a judgment model to be trained, and training to obtain the judgment model according to the result of the part marking and the characteristics of the judgment model to be trained;
a rule judging step of determining a preset rule set according to the extracted characteristic data related to the medicine and diagnosis, and determining a correct medicine diagnosis data set according to the preset rule set; the preset rule set is a logic judgment for determining a correct drug diagnosis data set;
a data fusion step, namely re-judging the data in the correct medicine diagnosis data set by using the judgment model obtained by training, manually marking the medicine diagnosis data with the judgment result of error again, marking the correct or error, supplementing the data after the manual marking again into part of marked characteristic data, and re-training the judgment model;
an iteration step, wherein the model training step and the data fusion step are repeatedly executed until the judging result and the labeling result of the judging model on the correct medicine diagnosis data set determined by the preset rule set are within a preset error, so as to obtain a final judging model;
and a model identification step, namely identifying the characteristic data related to the medicines and the diagnosis which are not marked by using a final judgment model.
2. The method for mining data of a drug knowledge graph according to claim 1, wherein the step of re-judging the correct drug diagnosis data set by using the trained judgment model and re-manually labeling the drug diagnosis data with the incorrect judgment result comprises the following steps:
and scoring the correct medicine diagnosis data set determined by the preset rule set by using the judgment model obtained through training, and manually marking the correct medicine diagnosis data set corresponding to the score smaller than the preset threshold value again.
3. The method for mining data of a drug knowledge graph according to claim 1, wherein the feature extraction step further comprises:
calculating the probability g of the simultaneous occurrence of the medicine and the diagnosis according to the ratio of the number f of the simultaneous occurrence of the medicine and the diagnosis in the prescription to the number a of the simultaneous occurrence of each diagnosis in all prescriptions;
counting the number j of different hospitals when each medicine and diagnosis appear in different hospital prescriptions;
counting the simultaneous occurrence number of medicines, diagnoses and hospital names, and counting the maximum occurrence number k of the hospital names in each medicine and diagnosis;
and calculating the ratio L of the maximum times of the hospital names in each medicine and diagnosis according to the ratio of the maximum times k of the hospital names in each medicine and diagnosis to the times f of the simultaneous occurrence of the medicine and the diagnosis in the prescription.
4. The method for mining data of a drug knowledge graph according to claim 1, wherein the feature extraction step further comprises:
counting the number m of different departments when each medicine and diagnosis are in different department prescriptions;
counting the simultaneous occurrence number of the medicines, the diagnoses and the department names, and counting the maximum occurrence number n of the department names in each medicine and the diagnoses;
and calculating the duty ratio o of the maximum times of the names of the departments in each medicine and diagnosis according to the ratio of the maximum times n of the names of the departments in each medicine and diagnosis to the times f of the simultaneous occurrence of the medicines and the diagnoses in the prescription.
5. The data mining method of a drug knowledge graph according to any one of claims 1-4, wherein:
and setting a preset rule set according to the extracted characteristic data related to the medicine and diagnosis, and obtaining a preliminary correct medicine diagnosis data set and a preliminary incorrect medicine diagnosis data set according to the preset rule set.
6. The data mining method of a drug knowledge graph according to claim 5, wherein:
and taking the data set which belongs to the preliminary correct medicine diagnosis data set and does not belong to the preliminary incorrect medicine diagnosis data set as the correct medicine diagnosis data set determined according to the preset rule set.
7. A data mining system for a drug knowledge graph, comprising:
a data acquisition module configured to acquire structured prescription data from a database, the prescription data including patient information, hospital information, and drug information;
a feature extraction module configured to extract feature data related to a drug and a diagnosis in the prescription data; calculating confidence scores for the drugs:
;
;
where h is the total confidence of the drug, S is the total number of prescriptions for the drug,is the confidence score for the drug in each prescription, β represents the weight attenuation value, n represents the number of drugs;
calculating a prescription confidence score i according to the difference between the occurrence times b of each medicine in all prescriptions and the confidence h of the medicine;
calculating a prescription quality score rate p according to the ratio of the prescription confidence score i to the frequency b of each medicine in all prescriptions;
calculating the probability c of each diagnosis in all prescriptions according to the ratio of the times a of each diagnosis in all prescriptions to the number z of all prescriptions;
calculating the probability d of each medicine in all prescriptions according to the ratio of the times b of each medicine in all prescriptions to the number z of all prescriptions;
calculating the ratio e of the probability d of each drug occurrence to the probability c of each diagnosis occurrence;
the model training module is configured to extract part of the characteristic data related to the medicines and the diagnosis for marking, take the characteristic data related to the medicines and the diagnosis as the characteristics of a judgment model to be trained, and train to obtain the judgment model according to the result of the part marking and the characteristics of the judgment model to be trained;
a rule judging module configured to determine a preset rule set according to the extracted characteristic data related to the medicine and the diagnosis, and determine a correct medicine diagnosis data set according to the preset rule set; the preset rule set is a logic judgment for determining a correct drug diagnosis data set;
the data fusion module is configured to re-judge the data in the correct medicine diagnosis data set by using the judgment model obtained by training, re-label the medicine diagnosis data with the judgment result of error manually, label the correct or error, supplement the data with the re-manually labeled to the part of the labeled characteristic data, and re-train the judgment model;
the iteration module is configured to repeatedly execute the model training step and the data fusion step until the judgment result of the judgment model on the correct medicine diagnosis data set determined by the preset rule set is the same as the labeling result, so as to obtain a final judgment model;
and the model identification module is configured to identify the characteristic data related to the medicine and the diagnosis which are not marked by the final judgment model.
8. An electronic device, comprising:
one or more processors;
storage means for storing one or more computer programs;
one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211670808.1A CN116486939B (en) | 2022-12-26 | 2022-12-26 | Data mining method and system for medicine knowledge graph and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211670808.1A CN116486939B (en) | 2022-12-26 | 2022-12-26 | Data mining method and system for medicine knowledge graph and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116486939A CN116486939A (en) | 2023-07-25 |
CN116486939B true CN116486939B (en) | 2024-01-23 |
Family
ID=87222015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211670808.1A Active CN116486939B (en) | 2022-12-26 | 2022-12-26 | Data mining method and system for medicine knowledge graph and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116486939B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191020A (en) * | 2019-12-27 | 2020-05-22 | 江苏省人民医院(南京医科大学第一附属医院) | Prescription recommendation method and system based on machine learning and knowledge graph |
CN111221979A (en) * | 2019-12-31 | 2020-06-02 | 北京左医健康技术有限公司 | Medicine knowledge graph construction method and system |
CN111445976A (en) * | 2020-03-24 | 2020-07-24 | 屹嘉智创(厦门)科技有限公司 | Intelligent rational medication system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3223177A1 (en) * | 2016-03-24 | 2017-09-27 | Fujitsu Limited | System and method to aid diagnosis of a patient |
-
2022
- 2022-12-26 CN CN202211670808.1A patent/CN116486939B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191020A (en) * | 2019-12-27 | 2020-05-22 | 江苏省人民医院(南京医科大学第一附属医院) | Prescription recommendation method and system based on machine learning and knowledge graph |
CN111221979A (en) * | 2019-12-31 | 2020-06-02 | 北京左医健康技术有限公司 | Medicine knowledge graph construction method and system |
CN111445976A (en) * | 2020-03-24 | 2020-07-24 | 屹嘉智创(厦门)科技有限公司 | Intelligent rational medication system and method |
Also Published As
Publication number | Publication date |
---|---|
CN116486939A (en) | 2023-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107808124B (en) | Electronic device, the recognition methods of medical text entities name and storage medium | |
CN108109700B (en) | Method and device for evaluating curative effect of chronic disease | |
CN107799160B (en) | Medication aid decision-making method and device, storage medium and electronic equipment | |
CN112635011A (en) | Disease diagnosis method, disease diagnosis system, and readable storage medium | |
CN111048167A (en) | Hierarchical case structuring method and system | |
US20120065997A1 (en) | Automatic Processing of Handwritten Physician Orders | |
Banerji et al. | Natural language processing combined with ICD-9-CM codes as a novel method to study the epidemiology of allergic drug reactions | |
CN113688255A (en) | Knowledge graph construction method based on Chinese electronic medical record | |
CN111180026A (en) | Special diagnosis and treatment view system and method | |
CN107239722B (en) | Method and device for extracting diagnosis object from medical document | |
CN109299467B (en) | Medical text recognition method and device and sentence recognition model training method and device | |
CN109065174A (en) | Consider the case history theme acquisition methods and device of similar constraint | |
RU2699607C2 (en) | High efficiency and reduced frequency of subsequent radiation studies by predicting base for next study | |
US9881004B2 (en) | Gender and name translation from a first to a second language | |
CN114334175A (en) | Hospital epidemic situation monitoring method and device, computer equipment and storage medium | |
CN116486939B (en) | Data mining method and system for medicine knowledge graph and electronic equipment | |
CN109616165A (en) | Medical information methods of exhibiting and device | |
CN112800758A (en) | Method, system, equipment and medium for distinguishing similar meaning words in text | |
JP7315165B2 (en) | Diagnosis support system | |
CN111863173A (en) | Medical record quality evaluation method and computing device | |
CN116092698A (en) | Prescription auditing method, device, system and storage medium | |
CN113821641B (en) | Method, device, equipment and storage medium for classifying medicines based on weight distribution | |
CN115775635A (en) | Medicine risk identification method and device based on deep learning model and terminal equipment | |
CN109493936B (en) | Method for detecting abnormal medication by using improved continuous bag-of-words model | |
CN112613313A (en) | Method, device, equipment, storage medium and program product for quality control of medical orders |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |