CN117473104A

CN117473104A - Knowledge graph construction method based on chronic disease management

Info

Publication number: CN117473104A
Application number: CN202311651846.7A
Authority: CN
Inventors: 马晓媛; 买名洋; 朱艳春; 高伟; 罗阳星; 韦佳威; 魏雪敏; 宋家炯; 黄欣欣; 徐甜甜; 荣艳
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-01-30

Abstract

The invention provides a knowledge graph construction method based on chronic disease management, which comprises the following steps: collecting and sorting data related to chronic diseases to obtain chronic disease data; constructing a priori entity word set and a priori relation word set related to chronic disease management; extracting the entity of the chronic disease data to obtain entity matrix data; mining a causal structure of the entity matrix data to obtain a causal graph structure reflecting causal relations among the entities; and constructing a knowledge graph by utilizing a causal graph structure. By integrating causal relationship in the knowledge graph, the knowledge graph is more easy to find causal relationship of chronic diseases in chronic disease management, and more accurate relationship description of chronic disease management can be realized, so that the management work of the chronic diseases is facilitated, and the problem that the existing knowledge graph for chronic disease management cannot better embody causal relationship among entities, so that the effect of the knowledge graph in chronic disease management is limited is solved.

Description

Knowledge graph construction method based on chronic disease management

Technical Field

The invention relates to the technical field of medicine and data processing, in particular to a knowledge graph construction method based on chronic disease management.

Background

Chronic diseases are general names of diseases which do not form infection and have long-term accumulation to form morphological lesions of the diseases, and common chronic diseases mainly include cardiovascular and cerebrovascular diseases, cancers, diabetes mellitus, chronic respiratory diseases and the like. The damage of chronic diseases mainly causes damage to important organs such as brain, heart, kidney and the like, is easy to cause disability, affects working capacity and life quality, has extremely high medical cost, and increases the economic burden of society and families.

At present, along with the continuous improvement of medical level, chronic diseases are usually managed by adopting an active prevention and passive treatment mode, wherein when chronic diseases are prevented and managed, the basic ideas and theoretical basis of traditional Chinese and western medicine are less ambiguous, and the action effects can be complemented, so that the chronic diseases are usually managed by adopting a mode of combining traditional Chinese and western medicine.

Although the current domestic Chinese and Western medicine has ICD codes, medical institutions such as hospitals and the like accumulate certain medical data, even part of medical institutions have built a self medical informatization system, the knowledge graph in the existing medical informatization system cannot well embody the causal relationship among entities, and the causal relationship has very important value in chronic disease management, and medical staff can provide a more accurate health management scheme for patients through the causal relationship. Therefore, the existing knowledge graph architecture cannot well play the role of the existing knowledge graph architecture when facing the growing chronic disease management demands.

Disclosure of Invention

The invention aims to provide a knowledge graph construction method based on chronic disease management, which aims to solve the problem that the existing knowledge graph aiming at chronic disease management cannot better reflect causal relationship among entities, so that the effect of the knowledge graph is limited in chronic disease management.

In order to solve the technical problems, the invention provides a knowledge graph construction method based on chronic disease management, which comprises the following steps:

collecting and sorting data related to chronic diseases to obtain chronic disease data;

constructing a priori entity word set and a priori relation word set related to chronic disease management;

extracting the entity of the chronic disease data to obtain entity matrix data;

mining a causal structure of the entity matrix data to obtain a causal graph structure reflecting causal relations among the entities;

and constructing a knowledge graph by utilizing a causal graph structure.

Optionally, in the knowledge graph construction method based on chronic disease management, the method for collecting and sorting data related to chronic disease to obtain chronic disease data includes:

obtaining authorized chronic medical data from a medical facility, the chronic medical data including disease characteristics of the chronic disease, complications, treatment regimens, therapeutic drugs, and records of diagnosis and treatment of the patient;

obtaining chronic disease disclosure data disclosed on the Internet, wherein the chronic disease disclosure data comprises medical inquiry records, medical inquiry records and medical books;

and performing quality control on the acquired chronic disease medical data and chronic disease public data to obtain chronic disease data.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for quality controlling the obtained chronic disease medical data and the obtained chronic disease public data to obtain the chronic disease data includes:

removing data which are irrelevant to chronic diseases from the acquired chronic disease medical data and chronic disease public data;

eliminating data which are obviously out of compliance with medical common sense from the acquired chronic disease medical data and chronic disease public data;

and combining the professional books and professionals, and confirming and combing the removed chronic disease medical data and chronic disease public data to obtain the chronic disease number.

Optionally, in the knowledge graph construction method based on chronic disease management, the method for constructing a priori entity word set and a priori relationship word set related to chronic disease management includes:

obtaining a medical term standard library;

text word segmentation and part-of-speech recognition are carried out on text contents in a medical term standard library so as to extract keywords;

screening nouns and verbs from the keywords, and taking the nouns as candidate entity words and the verbs as candidate relationship words;

screening and classifying the candidate entity words and the candidate relation words to obtain a priori entity word set and a priori relation word set.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for extracting entities from chronic disease data to obtain entity matrix data includes:

preprocessing chronic disease data;

extracting priori entity from the preprocessed chronic disease data in a text matching mode;

manually labeling part of high-quality sample fine-tuning BiLSTM-CRF models to extract model entities;

and combining the prior entity extraction result and the model entity extraction result to obtain entity matrix data.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for preprocessing chronic disease data includes:

and eliminating the disordered websites, unusual symbols and characters in the chronic disease data to obtain the pretreated chronic disease data.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for extracting the model entity by manually labeling part of the high-quality sample fine tuning BiLSTM-CRF model includes:

finely tuning the training BERT model in an unsupervised mode to obtain vectors of each word of the text;

manually labeling part of high-quality entity extraction task samples, wherein the manually labeled entity extraction task samples comprise chronic disease medical data and chronic disease public data;

and (5) performing model entity extraction on all text data by using a trained BiLSTM-CRF model.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for mining a causal structure of entity matrix data to obtain a causal graph structure reflecting causal relationships between entities includes:

and mining a causal structure of the entity matrix data by using a causal discovery PC algorithm to obtain a causal graph structure reflecting causal relations among the entities.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for constructing a knowledge graph by using a causal graph structure includes:

extracting a causal event set of a chronic disease entity with obvious causal relation from a causal graph structure;

extracting relationships between entities from the entity matrix data;

and constructing a knowledge graph according to the relation between the chronic disease entity causal event set and the entity.

Optionally, in the method for constructing a knowledge graph based on chronic disease management, the method for constructing a knowledge graph according to a relation between a causal event set of a chronic disease entity and the entity includes:

constructing a basic knowledge graph according to the relation between the entities;

and taking the causal event set of the chronic disease entity as a calibration set, and finely adjusting a DNN relation classification model to fuse the causal event set of the chronic disease entity into a basic knowledge graph so as to obtain a final knowledge graph.

Drawings

Fig. 1 is a flowchart of a knowledge graph construction method based on chronic disease management provided in this embodiment;

fig. 2 is an implementation block diagram of a knowledge graph construction method based on chronic disease management provided in this embodiment;

fig. 3 is a schematic structural diagram of an LSTM model provided in this embodiment;

FIG. 4 is a schematic diagram of the PC algorithm of the causal invention provided in this embodiment;

fig. 5 is a (partial) visual display effect diagram of the knowledge graph provided in the present embodiment.

Detailed Description

The knowledge graph construction method based on chronic disease management provided by the invention is further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention. Furthermore, the structures shown in the drawings are often part of actual structures. In particular, the drawings are shown with different emphasis instead being placed upon illustrating the various embodiments.

It is noted that "first", "second", etc. in the description and claims of the present invention and the accompanying drawings are used to distinguish similar objects so as to describe embodiments of the present invention, and not to describe a specific order or sequence, it should be understood that the structures so used may be interchanged under appropriate circumstances. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment provides a knowledge graph construction method based on chronic disease management, as shown in fig. 1, including:

s1, collecting and sorting data related to chronic diseases to obtain chronic disease data;

s2, constructing a priori entity word set and a priori relation word set related to chronic disease management;

s3, entity extraction is carried out on the chronic disease data to obtain entity matrix data;

s4, mining a causal structure of the entity matrix data to obtain a causal graph structure reflecting causal relations among the entities;

s5, constructing a knowledge graph by utilizing a causal graph structure.

According to the knowledge graph construction method based on chronic disease management, the causal relationship is integrated in the knowledge graph, so that the knowledge graph is easy to find the causal relationship of chronic disease in the chronic disease management, and the more accurate relationship description of the chronic disease management can be realized, the management work of the chronic disease is facilitated, and the problem that the existing knowledge graph aiming at the chronic disease management cannot better embody the causal relationship among entities, so that the effect of the knowledge graph in the chronic disease management is limited is solved.

It should be noted that, in the knowledge graph construction method provided in the embodiment, the order among the steps may be adjusted according to the actual situation, or other steps may be added between the steps to optimize the effect. Other steps and order of steps may be added without departing from the spirit of the present application.

In a specific embodiment, as shown in fig. 2, the method for constructing a knowledge graph based on chronic disease management provided in this embodiment generally includes: collecting chronic disease related data from clinical guidelines, medical documents, health books, network resources and the like, and obtaining chronic disease data by auditing the collected data; then obtaining text labels and vectors by means of manual knowledge arrangement and manual label determination or means of network health knowledge and keyword extraction; finally, the text labels and vectors are processed by using a program to obtain a final knowledge graph, and the knowledge graph can be stored in a database.

Further, in the present embodiment, step S1, the method for collecting and sorting the data related to chronic diseases to obtain chronic disease data includes:

s11, authorized chronic disease medical data is acquired from the medical institution, wherein the chronic disease medical data comprises but is not limited to disease characteristics, complications, treatment schemes, treatment drugs and diagnosis records of patients.

Specifically, in this embodiment, chronic disease medical data may be obtained from medical institutions such as hospitals, outpatients, disease control departments, medical institutions, etc. through an API interface, and the obtained various chronic disease medical data include data of traditional Chinese medicine type and data of western medicine type, so that the combination of traditional Chinese medicine and western medicine is enabled, and more effective guidance is provided for chronic disease management.

S12, acquiring chronic disease disclosure data disclosed on the Internet, wherein the chronic disease disclosure data comprises, but is not limited to, medical inquiry records, medical inquiry response records and medical books.

Specifically, in this embodiment, the chronic disease disclosure data disclosed on the internet may be obtained by a web crawler.

And S13, performing quality control on the acquired chronic disease medical data and chronic disease public data to obtain chronic disease data.

Specifically, in this embodiment, the method for quality control of acquired chronic disease medical data and chronic disease public data includes: removing data which are irrelevant to chronic diseases from the acquired chronic disease medical data and chronic disease public data; eliminating data which are obviously out of compliance with medical common sense from the acquired chronic disease medical data and chronic disease public data; and combining the professional books and professionals, and confirming and combing the removed chronic disease medical data and chronic disease public data to obtain the chronic disease number.

In practical application, the elimination of data materials irrelevant to chronic diseases and data not conforming to medical common sense can be performed manually or by means of some computer intelligent system. When the removed chronic disease medical data and chronic disease public data are confirmed and combed, manual work is generally adopted to improve the accuracy of confirmation and combing, and meanwhile, some flaw errors (such as pen errors, irregular expression, wrongly written characters and the like) can be manually modified.

Further, in this embodiment, step S2, the method for constructing the prior entity word set and the prior relationship word set related to chronic disease management includes:

s21, obtaining a medical term standard library.

Specifically, in this embodiment, the standard medical term library may be derived from a representative medical term standard library at home and abroad, such as a medical system nomenclature-clinical term (SNOMED CT), an OMAHA "tangram" medical term set, an integrated medical language system (UMLS), a chinese integrated medical language system (CUMLS), and the like.

S22, text word segmentation and part-of-speech recognition are carried out on the text content in the medical term standard library so as to extract keywords.

Specifically, text word segmentation and part-of-speech recognition can be performed manually or by using a natural language processing model. Of course, the text word segmentation and part of speech recognition can be performed by using the natural language processing model, and then the processing result of the natural language processing model is confirmed by using manpower, so that the accuracy of the text word segmentation and part of speech recognition is improved. Methods for text segmentation and part-of-speech recognition using natural language processing models are well known to those skilled in the art and are not described in detail herein.

S21, nouns and verbs are screened from the keywords, and the nouns are used as candidate entity words and the verbs are used as candidate relation words.

Likewise, this step may be performed manually or using a natural language processing model. . Of course, the keyword screening can be performed by using the natural language processing model, and then the processing result of the natural language processing model can be confirmed manually, so that the accuracy of the keyword screening can be improved.

And S21, screening and classifying the candidate entity words and the candidate relationship words to obtain a priori entity word set and a priori relationship word set.

Preferably, the step takes manual screening as a means, and by means of expert personnel with medical knowledge, screening and classifying candidate entity words and candidate relation words, removing unreasonable candidate words, distinguishing candidate words belonging to an entity word set or a relation word set, classifying the distinguished candidate words according to categories, and thus obtaining a priori entity word set and a priori check relation word set.

Further, in the embodiment, step S3, the method for extracting the entity from the chronic disease data to obtain the entity matrix data includes:

s31, preprocessing chronic disease data.

Specifically, in this embodiment, the method for preprocessing chronic disease data includes: and eliminating the disordered websites, unusual symbols and characters in the chronic disease data to obtain the pretreated chronic disease data. By preprocessing the chronic disease data, the efficiency and accuracy of data processing can be further improved.

S32, extracting priori entities from the preprocessed chronic disease data in a text matching mode.

S33, manually labeling part of the high-quality sample fine-tuning BiLSTM-CRF model to extract the model entity.

Specifically, in this embodiment, the method for extracting the model entity by manually labeling a part of the high-quality sample fine tuning BiLSTM-CRF model includes: firstly, finely tuning and training the BERT model in an unsupervised mode to obtain vectors of each word of the text; then, manually labeling part of high-quality entity extraction task samples, wherein the manually labeled entity extraction task samples comprise chronic disease medical data and chronic disease public data; and finally, performing model entity extraction on all text data by using a trained BiLSTM-CRF model.

In this embodiment, the cycle epoch of fine-tuning the BERT model is 10. The manual labeling mode is 'BIO', and the number of data strips is 5000. The BiLSTM-CRF model is the existing conventional BiLSTM-CRF model. Of course, in practical application, training rounds, labeling modes and the number of labeled data can be set according to practical needs. And, the mode of model training is well known to those skilled in the art, and this is not described in detail in this application.

It is contemplated that if analysis is performed using conventional cox regression, there are some difficulties in facing irregular follow-up data. For example, there are 7 follow-up blood glucose data for some patients, while another part is only 3, and such trapezoidal data is not useful for analysis. Therefore, the present embodiment uses the BiLSTM-CRF model for analysis, wherein the LSTM model structure is shown in FIG. 3. The BiLSTM-CRF model main body consists of a bidirectional long-short-time memory network and a conditional random field, wherein model input is character characteristics, and output is a prediction label corresponding to each character. According to the embodiment, by adopting the BiLSTM-CRF model, different people can be identified for testing, a prediction model for predicting the risk of an individual completely can be presented, and analysis of interaction of important prediction factors and the like can be performed.

S34, combining the prior entity extraction result and the model entity extraction result to obtain entity matrix data.

Still further, in this embodiment, step S4, the method for mining a causal structure of entity matrix data to obtain a causal graph structure reflecting causal relationships between entities includes:

The causal discovery PC algorithm is a basic causal learning algorithm, and by performing iterative fitting on observed data for a plurality of rounds, variables with causal relation in the observed data can be calculated and displayed in a data structure of a directed graph. As shown in FIG. 4, each edge in the causal graph structure and the entities at its ends form a candidate causal event triplet < entity 1, causal relationship, entity 2>, where entity 1 is the tail of the directed edge and entity 2 is the head of the directed edge (indicated by the arrow).

After obtaining the causal graph structure, in this embodiment, step S5, the method for constructing a knowledge graph by using the causal graph structure includes:

s51, extracting a causal event set of the chronic disease entity with obvious causal relation from the causal graph structure.

Specifically, in this embodiment, for each candidate causal event triplet, after causal identification and Do-calcul causal effect estimation, an average causal effect is calculated, and the magnitude of the average causal effect is taken as a confidence, and finally only causal event triples with the confidence greater than 0.05 are reserved, so as to form a chronic disease entity causal event set with obvious causal relationship.

In this embodiment, the causal event triples with a confidence level greater than 0.05 are considered as causal event triples with significant causal relationships. Of course, in other embodiments, different confidence thresholds may also be set to quantify the significance of the causal relationship.

S52, extracting the relation among the entities from the entity matrix data.

Specifically, in this embodiment, the extraction of the relationships between the entities is implemented by training a DNN entity relationship classification model, which includes: first, on the basis of entity matrix data, high-quality relation extraction classification task samples are manually marked (in this embodiment, 5000 relation extraction classification task samples are manually marked), and the selection standard of the sample data is the same as the standard of entity extraction, which may be the same piece of data. Then, the input of the DNN model is a map triplet entity relation pair < entity 1, relation and entity 2>, wherein entity 1 is a text vector, entity 2 is a text vector, the relation is a text vector of the text context where entity 1 and entity 2 are located, or an Onehot coding vector of a relation Label R_Label of entity 1 and entity 2, wherein the text vectors are extracted from the BERT model, and the relation Label is manually marked in the last step. Then, when predicting entity relationship, the DNN model needs to construct candidate entity pairs < entity 1, entity 2>, that is, taking 2 adjacent sentences (sentence division is based on period or english point number) as a sample, constructing an entity pair between all entities in a sample, and predicting the relationship of entity pairs through a trained DNN classification model. Finally, based on the prediction result of the DNN entity relationship classification model, a high-confidence atlas triplet is reserved, i.e. in this embodiment, only entity relationship pairs with relationship labels not equal to "unknown" and classification probability greater than 0.9 are reserved.

Of course, in practical application, a high-confidence confirmation principle may be set according to practical requirements, for example, the entity relation pair with the classification probability greater than 0.85 is confirmed as a high-confidence map triplet.

And S53, constructing a knowledge graph according to the relation between the causal event set of the chronic disease entity and the entity.

Specifically, in this embodiment, the method for constructing a knowledge graph includes: first, a basic knowledge graph is constructed according to the relationship between entities. And then, taking the causal event set of the chronic disease entity as a calibration set, and finely adjusting a training DNN relation classification model to fuse the causal event set of the chronic disease entity into a basic knowledge graph so as to obtain a final knowledge graph. The method comprises the following steps: taking the mined causal event triplet set < entity 1, causal relation, entity 2> as a real check set group_set; predicted triplet set < entity 1, prediction relationship, entity 2> as prediction set prediction_set; comparing relation labels in a check set group_set and a prediction set prediction_set of the causal event, if the relation labels are consistent, indicating that the DNN model predicts the causal event correctly, otherwise, indicating that the causal event is mispredicted, so as to obtain a prediction correct set and a prediction error set of the causal event set; performing fine tuning training on the DNN relation classification model by using a set of correct prediction and incorrect prediction, wherein in the embodiment, the sample weight of the set of incorrect prediction is set to be 2 times of that of the set of correct prediction samples, so that the DNN relation classification model is more concerned with learning of incorrect samples in the fine tuning process; and (3) re-predicting all triples of the constructed basic knowledge graph by using the finely-tuned DNN model, merging the causal event triples into a prediction result, and taking the causal event result as the reference when the merging is inconsistent, so that the final knowledge graph is obtained.

In this embodiment, the DNN model adopted in this step and the DNN model adopted in step S52 are the same DNN model, so that the utilization rate of the model is improved, and hardware resources are saved. And when the DNN model is subjected to fine tuning training, the weight setting can be adjusted according to actual needs, and the protection scope of the application is not limited to the method.

Preferably, the Neo4j graphic database may be used to visualize the chronicity management knowledge graph.

According to the knowledge graph construction method based on chronic disease management, the knowledge graph is fused with the causal event relationship among the entities, so that the knowledge graph is not only beneficial to further finding the causal relationship of the chronic disease, but also beneficial to the prevention and treatment of the chronic disease; the causal event mining adopts causal inference methods of causal discovery and causal effect estimation, so that mining results are more accurate and have better interpretability; therefore, more accurate relation description can be realized in the chronic disease management, the application range of the method is wide, and multi-scene application such as chronic disease management, active health, AI intelligent question and answer and the like can be realized.

In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, so that the same similar parts of each embodiment are referred to each other.

The knowledge graph construction method based on chronic disease management provided by the embodiment comprises the following steps: collecting and sorting data related to chronic diseases to obtain chronic disease data; constructing a priori entity word set and a priori relation word set related to chronic disease management; extracting the entity of the chronic disease data to obtain entity matrix data; mining a causal structure of the entity matrix data to obtain a causal graph structure reflecting causal relations among the entities; and constructing a knowledge graph by utilizing a causal graph structure. By integrating causal relationship in the knowledge graph, the knowledge graph is more easy to find causal relationship of chronic diseases in chronic disease management, and more accurate relationship description of chronic disease management can be realized, so that the management work of the chronic diseases is facilitated, and the problem that the existing knowledge graph for chronic disease management cannot better embody causal relationship among entities, so that the effect of the knowledge graph in chronic disease management is limited is solved.

The above description is only illustrative of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention, and any alterations and modifications made by those skilled in the art based on the above disclosure shall fall within the scope of the appended claims.

Claims

1. The knowledge graph construction method based on chronic disease management is characterized by comprising the following steps of:

extracting the entity of the chronic disease data to obtain entity matrix data;

and constructing a knowledge graph by utilizing a causal graph structure.

2. The knowledge graph construction method based on chronic disease management according to claim 1, wherein the method for collecting and sorting data related to chronic disease to obtain chronic disease data comprises:

3. The knowledge graph construction method based on chronic disease management according to claim 2, wherein the method for quality controlling the acquired chronic disease medical data and chronic disease public data to obtain chronic disease data comprises:

4. The knowledge graph construction method based on chronic disease management according to claim 1, wherein the method for constructing prior entity word sets and prior relationship word sets related to chronic disease management comprises:

obtaining a medical term standard library;

5. The knowledge graph construction method based on chronic disease management according to claim 1, wherein the method for performing entity extraction on chronic disease data to obtain entity matrix data comprises:

preprocessing chronic disease data;

6. The knowledge graph construction method based on chronic disease management according to claim 5, wherein the method for preprocessing chronic disease data comprises:

7. The knowledge graph construction method based on chronic disease management according to claim 5, wherein the method for performing model entity extraction by manually labeling part of high-quality sample fine tuning BiLSTM-CRF model comprises:

8. The knowledge graph construction method based on chronic disease management according to claim 1, wherein the method for mining a causal structure of entity matrix data to obtain a causal graph structure reflecting causal relationships between entities comprises:

9. The knowledge graph construction method based on chronic disease management according to claim 1, wherein the method for constructing a knowledge graph using a causal graph structure comprises:

extracting relationships between entities from the entity matrix data;

10. The knowledge graph construction method based on chronic disease management according to claim 9, wherein the method for constructing a knowledge graph according to the relation between a set of causal events of a chronic disease entity and the entity comprises: