CN111028952B

CN111028952B - Method and device for constructing Chinese medical implication knowledge graph

Info

Publication number: CN111028952B
Application number: CN201911179731.6A
Authority: CN
Inventors: 史亚飞
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2023-08-04
Anticipated expiration: 2039-11-27
Also published as: CN111028952A

Abstract

The invention provides a method and a device for constructing a Chinese medical implication knowledge graph. The method comprises the following steps: acquiring a first medical entity and a second medical entity; integrating the first medical entity and the second medical entity, and performing de-duplication filtering on the first medical entity and the second medical entity and the entity in a preset medical knowledge graph to obtain a third medical entity which does not exist in the preset medical knowledge graph and needs to be aligned; fine tuning the pre-trained model to obtain a medical entity implication model; inputting the third medical entity and the entity with the highest similarity into the medical entity implication model, and determining the relationship between the entities; and updating the preset medical knowledge graph according to the relation between the third medical entity and the entity.

Description

Method and device for constructing Chinese medical implication knowledge graph

Technical Field

The invention relates to the technical field of Internet, in particular to a method and a device for constructing a Chinese medical implication knowledge graph.

Background

With more and more semantic web data being opened on the internet, various internet search engine companies at home and abroad begin to construct knowledge maps based on the semantic web data so as to improve service quality, such as Google knowledge maps (Google Knowledge Graph), hundred degrees 'awareness' and the like. The Knowledge Graph (knowledgegraph) is essentially a Chinese network, nodes of which represent entities (entities) or concepts (concepts), and links represent various semantic relationships between the entities or concepts, and is a service mode of Knowledge management, which can connect trivial and scattered Knowledge in various fields with each other, so as to form a huge and networked Knowledge system constructed by taking a semantic network as a framework. Now, people have begun to apply knowledge graphs to intelligent systems such as comprehensive knowledge retrieval, question-answer and decision support.

At present, a medical knowledge graph is constructed, all entities extracted from a data source are input into a neural network model, a great deal of work is needed, the efficiency of constructing the medical knowledge graph is reduced, and how to improve the efficiency is a technical problem to be solved urgently.

Disclosure of Invention

The invention provides a method for constructing a Chinese medical implication knowledge graph, which comprises the following steps:

acquiring a first medical entity and a second medical entity;

integrating the first medical entity and the second medical entity, and performing de-duplication filtering on the first medical entity and the second medical entity and the entity in a preset medical knowledge graph to obtain a third medical entity which does not exist in the preset medical knowledge graph and needs to be aligned;

fine tuning the pre-trained model to obtain a medical entity implication model;

inputting the third medical entity and the entity with the highest similarity into the medical entity implication model, and determining the relationship between the entities;

and updating the preset medical knowledge graph according to the relation between the third medical entity and the entity.

The beneficial effects of this embodiment lie in: and performing de-duplication filtering on the acquired medical entities, leaving third medical entities which are not in the existing medical knowledge graph and need to be aligned, obtaining the entity with the highest similarity with the third medical entity in the existing medical knowledge graph through retrieval and calculation, and then inputting the third medical entity and the entity with the highest similarity into the model, so that other medical entities are not required to be input into the model, the number of the entities needing to be input into the model is greatly reduced, and the efficiency is improved.

Specifically, the acquiring the first medical entity and the second medical entity includes:

acquiring data from the network as a data source;

extracting related data in the medical field from the data source;

performing medical named entity identification by using the finely tuned deep learning pre-trained model to obtain the first medical entity;

the second medical entity is obtained from a structured medical document.

Specifically, the method for obtaining the medical entity implication model by fine tuning the pre-trained model includes:

acquiring a labeling data set from the preset medical knowledge graph;

extracting a training data set and a test data set required by constructing a medical entity implication model from the labeling data set;

and placing the training data set and the test data set in a pre-trained model for training and testing in a fine tuning mode to obtain the medical entity implication model.

Specifically, the third medical entity and the entity with the highest similarity include:

and searching the preset medical knowledge base according to a preset algorithm to obtain the entity with the highest similarity to the third medical entity.

Specifically, the updating the preset medical knowledge graph according to the relationship between the third medical entity and the entity includes:

judging whether the relationship between the third medical entity and the entity is an implication relationship, and if so, updating the preset medical knowledge graph.

The invention also provides a device for constructing the Chinese medical implication knowledge graph, which comprises the following steps:

the acquisition module is used for acquiring the first medical entity and the second medical entity;

the screening module is used for integrating the first medical entity with the second medical entity, de-duplicating and filtering the first medical entity and the second medical entity with the entity in the preset medical knowledge graph to obtain a third medical entity which does not exist in the preset medical knowledge graph and needs to be aligned;

the fine tuning module is used for fine tuning the pre-trained model to obtain a medical entity implication model;

the determining module is used for inputting the third medical entity and the entity with the highest similarity into the medical entity implication model to determine the relationship between the entities;

and the updating module is used for updating the preset medical knowledge graph according to the relation between the third medical entity and the entity.

Specifically, the acquisition module includes:

the first acquisition submodule is used for acquiring data from the network as a data source;

the extraction submodule is used for extracting relevant data in the medical field in the data source;

the identification sub-module is used for carrying out medical named entity identification by using the finely-tuned deep learning pre-trained model to obtain the first medical entity;

and the second acquisition sub-module is used for acquiring the second medical entity from the structured medical document.

Specifically, the fine tuning module includes:

the third acquisition sub-module is used for acquiring a labeling data set from the preset medical knowledge graph;

the extraction sub-module is used for extracting a training data set and a test data set required by constructing a medical entity implication model from the labeling data set;

and the fine tuning sub-module is used for placing the training data set and the test data set in a pre-trained model for training and testing in a fine tuning mode to obtain the medical entity implication model.

Specifically, the determining module includes:

and the retrieval sub-module is used for retrieving the preset medical knowledge base according to a preset algorithm to obtain an entity with the highest similarity to the third medical entity.

Specifically, the updating module includes:

and the judging sub-module is used for judging whether the relation between the third medical entity and the entity is an implication relation or not, and updating the preset medical knowledge graph if the relation is implication relation.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a method for constructing a knowledge graph of Chinese medical implications according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for constructing a knowledge graph of Chinese medical implications according to an embodiment of the invention;

FIG. 3 is a block diagram of a device for constructing a knowledge graph of Chinese medical implications according to an embodiment of the invention;

fig. 4 is a block diagram of a device for constructing a knowledge graph of Chinese medical implications according to an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Fig. 1 is a flowchart of a method for constructing a knowledge graph of chinese medical implications according to an embodiment of the invention, as shown in fig. 1, the method may be implemented as steps S11-S12 as follows:

in step S11, a first medical entity and a second medical entity are acquired;

in step S12, integrating the first medical entity with the second medical entity, and performing duplication elimination and filtration on the first medical entity and the second medical entity with the entity in the preset medical knowledge graph to obtain a third medical entity which does not exist in the preset medical knowledge graph and needs to be aligned;

in step S13, a medical entity implication model is obtained by fine tuning the pre-trained model;

in step S14, inputting the third medical entity and the entity with the highest similarity into the medical entity implication model, and determining the relationship between the entities;

in step S15, the preset medical knowledge graph is updated according to the relationship between the third medical entity and the entity.

In this embodiment, the preset medical knowledge graph may be an existing medical knowledge graph, the first medical entity is obtained from a network and data in reality, the second medical entity is obtained from an unstructured medical document, the first medical entity and the second medical entity are integrated, and compared with the existing medical knowledge graph, a medical entity which does not exist in the existing knowledge graph and needs to be aligned is left, and the medical entity is a third medical entity; pre-training a general model in the medical field through large-scale medical corpus, and fine-tuning the general model by using marked data to obtain a medical entity implication model; and (3) searching and calculating the third medical entity and the entity in the existing medical knowledge graph to obtain the entity with the highest similarity with the third medical entity, inputting the third medical entity and the entity with the highest similarity into the medical entity implication model, outputting implication relation between the third medical entity and the entity with the highest similarity by using the medical entity model, and updating the existing medical knowledge graph by using the implication relation between the third medical entity and the entity with the highest similarity to obtain a new medical knowledge graph.

For example: the medical entity may be "diabetes", "diuresis", "insulin injection", and the relationship between entities may be "diabetes" with symptoms of "diuresis", "insulin injection" may treat diabetes.

It should be noted that, the "first medical entity", "second medical entity", and "third medical entity" do not refer to a single entity; the entity with the highest similarity with the third medical entity can be one or a plurality of entities; medical entities requiring alignment refer to entities having different identities representing the same object, i.e. the alignment is the merging of the entities into an entity having a unique identity.

In one embodiment, the above step S11 may be implemented as steps A1-A4 as follows:

in step A1, acquiring data from the network as a data source;

in step A2, extracting relevant data in the medical field in a data source;

in step A3, performing medical named entity recognition by using a fine-tuned deep learning pre-trained model to obtain a first medical entity;

in step A4, a second medical entity is obtained from the structured medical document.

For example: taking network crawling data (medical encyclopedia, medical websites), medical documents (clinical guidelines, medical teaching materials) and unstructured data of clinical medical records as data sources, acquiring related data of medical fields from the data sources, and using a fine-tuned deep learning pre-trained model Bert as a medical naming entity to identify, so as to obtain a first medical entity; the second medical entity is obtained from the medical document that has been structured.

In one embodiment, as shown in FIG. 2, the above step S13 may be implemented as steps S21-S23 as follows:

in step S21, a labeling data set is obtained from a preset medical knowledge graph;

in step S22, a training data set and a test data set required for constructing a medical entity implication model are extracted from the labeling data set;

in step S23, the training data set and the test data set are placed in the pre-trained model for training and testing by means of fine tuning, so as to obtain the medical entity implication model.

In the embodiment, a general model in the medical field is pre-trained through large-scale medical corpus, and then a medical entity implication model is finely trained on the general model through marked data; specifically, a labeling data set required by the implication model is constructed by utilizing the entity upper-lower relationship and the synonym relationship in the existing medical knowledge graph.

The labeling data set required for constructing the implication model comprises:

constructing a positive example containing a model annotation data set, and randomly selecting an entity and a directly upper entity or a synonymous entity of the entity as a positive example sample;

and constructing a negative example containing the model annotation data set, and randomly selecting an entity and a direct system lower entity of the entity as negative example samples.

Taking 70% of the marked data set as a training data set and 30% as a test data set, and then placing the training data set and the test data set in a pretrained model Bert for training and testing in a fine tuning mode to obtain a medical entity implication model.

Note that, the implication relationship means that, for the entity a and the entity B, if the entity a is a lower relationship or a synonymous relationship of the entity B, the entity a implication the entity B.

In one embodiment, the step S14 may be implemented as the following steps, including:

Calculating the final similarity score of the third medical entity Q and the medical entity D in the existing medical knowledge graph according to the following formula:

wherein ,q_i Represents an element obtained by word segmentation of the medical entity D, f (q _i D) represents q _i Word frequency in entity D, |d| represents the number of words that medical entity D contains, avgdl represents the number of words that entities average contains in all medical knowledge maps, k ₁ And b represents a freely adjustable parameter, default k.epsilon. 1.2,2.0]B=0.75; score (D, Q) is the final similarity score; IDF represents the inverse text frequency index; wherein, the IDF is calculated based on the following manner;

wherein ,IDF_i An inverse text frequency index representing the i-th word, N being the total number of medical entities D in the existing medical knowledge graph, N (q _i ) Representing the number of medical entities D containing the i-th word of the retrieval entity.

In one embodiment, the step S15 may be implemented as the following steps, including:

In this embodiment, the implication model is applied to the X entities with the highest similarity of the third medical entity Q, so as to obtain the upper-lower relationship or the synonymous relationship between the third medical entity Q and the X entities with the highest similarity.

For q _i ∈Q，x _i E, X, the detailed specification is as follows:

if q _i Is filled with x _i And x is _i Implication q _i Q is _i And x _i Belongs to the synonymous relation;

if q _i Is filled with x _i But x is _i Does not contain q _i Q is _i Is x _i Lower relationship of (2);

if q _i Does not contain x _i But x is _i Implication q _i Q is _i Is x _i Is a higher order relationship of (1);

if q _i Does not contain x _i And x is _i Does not contain q _i Q is _i And x _i There is no relation.

Judging the relation between the third medical entity and X entities with the highest similarity, and if the relation is satisfied, updating the existing medical knowledge graph by using the relation between the third medical entity satisfying the relation and the X entities with the highest similarity; if not, the third medical entity is removed.

Fig. 3 is a block diagram of a device for constructing a knowledge graph of chinese medical implications according to an embodiment of the present invention, and as shown in fig. 3, the device may include the following modules:

an acquisition module 31 for acquiring a first medical entity and a second medical entity;

a screening module 32, configured to integrate the first medical entity with the second medical entity, and perform duplicate removal and filtration on the first medical entity and the second medical entity in a preset medical knowledge graph to obtain a third medical entity that does not exist in the preset medical knowledge graph and needs to be aligned;

a fine tuning module 33 for obtaining a medical entity implication model by fine tuning the pre-trained model;

a determining module 34, configured to input the third medical entity and the entity with the highest similarity into the medical entity implication model, and determine a relationship between the entities;

and an updating module 35, configured to update the preset medical knowledge graph according to the relationship between the third medical entity and the entity.

In one embodiment, as shown in fig. 4, the acquiring module 31 includes:

a first obtaining sub-module 41, configured to obtain data from the network as a data source;

an extraction sub-module 42 for extracting data related to the medical field in the data source;

an identification sub-module 43, configured to use the fine-tuned deep learning pre-trained model to perform medical named entity identification, so as to obtain a first medical entity;

a second acquisition sub-module 44 for acquiring a second medical entity from the structured medical document.

In one embodiment, the trimming module comprises:

the third acquisition sub-module is used for acquiring a labeling data set from a preset medical knowledge graph;

the extraction sub-module is used for extracting a training data set and a test data set required by constructing the medical entity implication model from the labeling data set;

and the fine tuning sub-module is used for placing the training data set and the test data set in a pre-trained model for training and testing in a fine tuning mode to obtain a medical entity implication model.

In one embodiment, the determining module includes:

and the retrieval sub-module is used for retrieving a preset medical knowledge base according to a preset algorithm to obtain an entity with the highest similarity to the third medical entity.

In one embodiment, the update module includes:

and the judging sub-module is used for judging whether the relationship between the third medical entity and the entity is an implication relationship or not, and updating the preset medical knowledge graph if the relationship is implication relationship.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The method for constructing the Chinese medical implication knowledge graph is characterized by comprising the following steps of:

acquiring a first medical entity and a second medical entity, comprising:

acquiring data from the network as a data source;

extracting related data in the medical field from the data source;

obtaining the second medical entity from a structured medical document;

fine tuning the pre-trained model to obtain a medical entity implication model;

updating the preset medical knowledge graph according to the relation between the third medical entity and the entity;

searching a preset medical knowledge base according to a preset algorithm to obtain an entity with the highest similarity to the third medical entity, wherein the searching comprises the following steps:

；

wherein ,represents the elements after word segmentation of the medical entity D in the existing medical knowledge graph,representation ofWord frequency in the medical entity D in the existing medical knowledge-graph,indicating that the medical entity D in the existing medical knowledge graph contains the number of words, avgdl indicating the number of words contained in the average of the entities in all the medical knowledge graphs,and b represents a parameter that can be freely adjusted, by default,，b=0.75；IDF represents the inverse text frequency index; wherein, the IDF is calculated based on the following manner;

；

wherein ,IDF_i An inverse text frequency index representing the ith word, N being the total number of medical entities D in the existing medical knowledge graph,representing the number of medical entities D in the existing medical knowledge-graph containing the i-th word of the retrieval entity.

2. The method of claim 1, wherein the obtaining the medical entity implication model by fine-tuning the pre-trained model comprises:

acquiring a labeling data set from the preset medical knowledge graph;

3. The method of claim 1, wherein updating the preset medical knowledge-graph according to the relationship between the third medical entity and the entity comprises:

4. The utility model provides a chinese medical science implication knowledge graph construction device which characterized in that includes:

an acquisition module for acquiring a first medical entity and a second medical entity, comprising:

a second acquisition sub-module for acquiring the second medical entity from the structured medical document;

the updating module is used for updating the preset medical knowledge graph according to the relation between the third medical entity and the entity;

；

5. The apparatus of claim 4, wherein the trimming module comprises:

6. The apparatus of claim 4, wherein the update module comprises: