CN113821597A

CN113821597A - Entity chain pointing method and system for natural language text and medical knowledge graph

Info

Publication number: CN113821597A
Application number: CN202111052099.6A
Authority: CN
Inventors: 刘鹏; 王则远
Original assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Current assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-12-21

Abstract

The invention provides an entity chain finger method of a natural language text and a medical knowledge graph, which comprises the steps of obtaining a natural language text related to medical knowledge; identifying a target medical entity on a natural language text by using a named entity identification model, and obtaining the category of the target medical entity; acquiring a plurality of candidate medical entities corresponding to the categories of the target medical entities from the medical knowledge graph; and respectively calculating the similarity between each candidate medical entity and the target medical entity through a similarity scoring model, and realizing the chain finger of the candidate medical entities and the target medical entity according to the sequence of the similarity. According to the invention, the similarity between the candidate medical entity and the target medical entity is calculated through the similarity scoring model, so that the similarity between the two entities can be described from a semantic level, the chain finger between the candidate medical entity and the target medical entity is realized according to the sequence of the similarity, the chain finger precision can be further improved, and the obtained medical knowledge map is more beneficial to maintenance.

Description

Entity chain pointing method and system for natural language text and medical knowledge graph

Technical Field

The present invention relates to the field of medical knowledge graph processing technologies, and in particular, to a method, a system, a device, a non-transitory computer-readable storage medium, and a computer program product for entity chaining of natural language text and medical knowledge graph.

Background

With the development of the medical health field, the amount of medically related knowledge is increasing. Meanwhile, with the continuous acceleration of the digitization process, massive medical-related data information is generated in the internet and various information systems. Constructing a medical knowledge map is an efficient way to better organize and utilize this information. The traditional method of linking entities with a knowledge graph is usually a method of directly matching a target entity with an entity in the knowledge graph based on a knowledge base or keywords, which depends on the quality of the knowledge base and keywords, and does not consider the complexity of a real scene, for example, the semantics of the target entity is complex, and it is likely that the target entity has correspondence with a plurality of entities in the knowledge graph, how is the target entity exactly matched and linked? Such medical knowledge maps are too inaccurate to maintain if the target entity is simply linked to multiple entities in the knowledge map.

Disclosure of Invention

The invention provides a method, a system, equipment, a non-transient computer readable storage medium and a computer program product for entity chain finger of a natural language text and a medical knowledge graph, wherein the similarity between a candidate medical entity and a target medical entity is calculated through a similarity scoring model, the similarity between the two entities can be described from a semantic level, the chain finger of the candidate medical entity and the target medical entity is realized according to the sequence of the similarity, the precision of the chain finger can be further improved, and the obtained medical knowledge graph is more beneficial to maintenance.

The invention provides an entity chain indicating method of a natural language text and a medical knowledge graph, which comprises the following steps:

obtaining a natural language text related to medical knowledge;

identifying a target medical entity on the natural language text by using a named entity identification model, and obtaining the category of the target medical entity;

acquiring a plurality of candidate medical entities corresponding to the category to which the target medical entity belongs from a medical knowledge graph;

and respectively calculating the similarity between each candidate medical entity and the target medical entity through a similarity scoring model, and realizing the chain finger of the candidate medical entity and the target medical entity according to the sequence of the similarity.

According to the entity chain finger method of the natural language text and the medical knowledge graph, provided by the invention, the similarity between each candidate medical entity and the target medical entity is respectively calculated through a similarity scoring model, and the chain finger of the candidate medical entity and the target medical entity is realized according to the sequence of the similarity, and the method comprises the following steps:

constructing a plurality of candidate entity relationship pairs by respectively combining the candidate medical entities with the target medical entity;

respectively calculating the similarity between the candidate medical entity and the target medical entity in the plurality of candidate entity relationship pairs through a similarity scoring model;

and selecting a pair of candidate entity relationship pairs with the highest similarity, and aligning the candidate medical entities in the pair of candidate entity relationship pairs with the target medical entity to realize the chain finger of the candidate medical entity and the target medical entity.

According to the entity chain indicating method of the natural language text and the medical knowledge graph, provided by the invention, the similarity between the candidate medical entity and the target medical entity in a plurality of candidate entity relationship pairs is respectively calculated through a similarity scoring model, and the method specifically comprises the following steps: the similarity scoring model comprises a first similarity calculation model and a second similarity calculation model, the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair is calculated through the first similarity calculation model to obtain a first similarity, the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair is calculated through the second similarity calculation model to obtain a second similarity, and the first similarity and the second similarity are subjected to weighted summation to obtain the final similarity.

According to the entity chain indicating method of the natural language text and the medical knowledge graph, provided by the invention, the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair is calculated through the first similarity calculation model to obtain the first similarity, and the method specifically comprises the following steps: and obtaining the vector of the candidate entity relationship pair, and carrying out sigmoid function calculation according to the vector of the candidate entity relationship pair and the linear transformation weight to obtain a first similarity.

According to the entity chain indicating method of the natural language text and the medical knowledge graph, provided by the invention, the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair is calculated through the second similarity calculation model to obtain the second similarity, and the method specifically comprises the following steps: and respectively converting the candidate medical entity and the target medical entity in the same candidate entity relationship pair into a candidate medical entity vector and a target medical entity vector, and calculating a cosine value of an included angle between the candidate medical entity vector and the target medical entity vector to obtain a second similarity.

The invention also provides an entity chain finger system of the natural language text and the medical knowledge graph, which comprises the following steps:

a natural language text obtaining module for obtaining a natural language text related to medical knowledge;

the target medical entity recognition module is used for recognizing a target medical entity on the natural language text by utilizing a named entity recognition model and obtaining the category of the target medical entity;

the candidate medical entity acquisition module is used for acquiring a plurality of candidate medical entities corresponding to the category of the target medical entity from the medical knowledge graph;

and the entity similarity calculation module is used for calculating the similarity between each candidate medical entity and the target medical entity through a similarity scoring model and realizing the chain finger of the candidate medical entity and the target medical entity according to the sequence of the similarity.

According to the entity chain finger system of the natural language text and the medical knowledge graph, the entity similarity calculation module comprises:

the candidate entity relationship pair construction module is used for constructing a plurality of candidate entity relationship pairs by the candidate medical entities and the target medical entity respectively;

the similarity calculation module is used for calculating the similarity between the candidate medical entity and the target medical entity in the plurality of candidate entity relationship pairs respectively through a similarity scoring model;

and the entity chain finger module is used for selecting a pair of candidate entity relationship pairs with the highest similarity and aligning the candidate medical entities in the pair of candidate entity relationship pairs with the target medical entity so as to realize chain finger of the candidate medical entities and the target medical entity.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the entity chain indexing method for natural language text and medical knowledge-graph as described in any one of the above.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for entity chaining of natural language text and medical knowledge-graphs as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the method for entity chaining of natural language text and medical knowledge-maps according to any of the above.

The entity chain pointing method, the system, the equipment, the non-transient computer readable storage medium and the computer program product of the natural language text and the medical knowledge graph provided by the invention have the advantages that the target medical entity is identified on the natural language text through the named entity identification model, the category to which the target medical entity belongs is obtained, then a plurality of candidate medical entities corresponding to the category to which the target medical entity belongs are obtained from the medical knowledge graph, the similarity between the candidate medical entity and the target medical entity is calculated one by one through the similarity scoring model, and the chain pointing of the candidate medical entity and the target medical entity is realized according to the sequence of the similarity. According to the invention, the similarity between the candidate medical entity and the target medical entity is calculated through the similarity scoring model, so that the similarity between the two entities can be described from a semantic level, the chain finger between the candidate medical entity and the target medical entity is realized according to the sequence of the similarity, the chain finger precision can be further improved, and the obtained medical knowledge map is more beneficial to maintenance.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart diagram of an entity chain indexing method for natural language text and medical knowledge-graph provided by the invention.

FIG. 2 is a block diagram of an entity chain finger system of natural language text and medical knowledge-graph provided by the present invention.

Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An entity chain referring method of natural language text and medical knowledge graph, as shown in fig. 1, includes:

s1: natural language text associated with medical knowledge is obtained.

S2: and identifying a target medical entity on the natural language text by using a named entity identification model, and obtaining the category of the target medical entity.

Specifically, the named entity recognition model is an XLNET-based named entity recognition model.

XLNET is a generalized autoregressive pre-training model. It can learn the two-way context information by maximizing the log-likelihood of all possible factorization orders, and overcome the possible difference of pre-training and fine-tuning effects of other pre-training language models by using the characteristics of autoregressive itself. In addition, XLNET is also very suitable for processing long texts and more meets the application requirements of complex scenes. It can be seen that XLNet relies from "one-way" to "two-way" contexts, and from "short-range" to "long-range", being the most refined model for modeling contexts today.

The pre-training data volume used by XLNET is probably the largest in the existing model, when the model capacity is large enough, the logarithm and the performance improvement of the data volume are nearly proportional in a certain range, so the named entity recognition model based on XLNET has higher precision and better accuracy.

S3: and acquiring a plurality of candidate medical entities corresponding to the category of the target medical entity from the medical knowledge map.

Specifically, the categories to which the target medical entity belongs include diagnosis, disease, symptom, examination items, and the like.

For example, when the identified target medical entity is "late stage of pregnancy" and its category is "diagnosis", then all entities in the knowledgebase under the category of "diagnosis", such as "pregnancy", "diabetes", "hypertension", etc., are candidate medical entities.

S4: and respectively calculating the similarity between each candidate medical entity and the target medical entity through a similarity scoring model, and realizing the chain finger of the candidate medical entity and the target medical entity according to the sequence of the similarity.

The similarity scoring model scores a score by calculating a similarity between the vector of the candidate medical entity and the vector of the target medical entity, with the similarity scoring model scoring a higher score as the similarity is higher.

Specifically, S4 includes:

s41: and constructing a plurality of candidate entity relation pairs by the candidate medical entities and the target medical entity respectively.

S42: and respectively calculating the similarity between the candidate medical entity and the target medical entity in the plurality of candidate entity relationship pairs through a similarity scoring model.

Specifically, the similarity scoring model includes a first similarity calculation model and a second similarity calculation model, the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair is calculated through the first similarity calculation model to obtain a first similarity, the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair is calculated through the second similarity calculation model to obtain a second similarity, and the first similarity and the second similarity are subjected to weighted summation to obtain a final similarity.

The first similarity calculation model adopts a BERT model, and the second similarity calculation model adopts a bag-of-words model.

The formula for calculating the final similarity is as follows: c ═ w₁*a+w₂*b，w₁+w₂1, c denotes a final similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair, a denotes a first similarity, b denotes a second similarity, w denotes a second similarity₁Weight sum w representing first similarity₂A weight representing the second degree of similarity.

Preferably, w₁And w₂The values of (A) were all 0.5. In addition, the value of the weight may be appropriately adjusted according to the actual usage.

Further, the calculating, by the first similarity calculation model, the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair to obtain a first similarity includes:

s421: and obtaining the vector of the candidate entity relation pair.

For example, both the candidate medical entity and the target medical entity in the candidate entity relationship pair are sentences, before the candidate entity relationship pair is input into the first similarity calculation model, the [ CLS ] is added to the head of the candidate entity relationship pair, the [ SEP ] is added between the two sentences as a partition, and then the [ CLS ] in the output of the first similarity calculation model is taken as a vector of the candidate entity relationship pair, which can represent the semantics of the whole candidate entity relationship pair to a certain extent.

S422: performing sigmoid function calculation according to the vector of the candidate entity relationship pair and the linear transformation weight to obtain a first similarity, wherein a formula 2 for performing sigmoid function calculation on the vector of the candidate entity relationship pair is as follows: p ═ sigmoid (cW '), P denotes the first degree of similarity, sigmoid (cW') denotes sigmoid function calculations performed on cW ', c denotes vectors of candidate entity-relationship pairs, and W' denotes linear transformation weights.

Calculating the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair through the second similarity calculation model to obtain a second similarity, specifically S423: and (3) respectively converting the candidate medical entity and the target medical entity into a candidate medical entity vector and a target medical entity vector without considering the sequence of each word in the entity sentence and only considering the occurrence frequency of the word in the entity sentence, and calculating the cosine value of the included angle between the candidate medical entity vector and the target medical entity vector to obtain the second similarity.

S43: and selecting a pair of candidate entity relationship pairs with the highest similarity, and aligning the candidate medical entities in the pair of candidate entity relationship pairs with the target medical entity to realize the chain finger of the candidate medical entity and the target medical entity.

The invention covers two aspects of named entity recognition and entity alignment, has no strict requirement on the form of the input text, does not need structured text, can realize the communication with the medical knowledge map as long as the natural language text which accords with the daily language communication habit and is related to the medicine can be communicated with the medical knowledge map, is more suitable for complex service scenes, can realize the end-to-end inquiry and answer from the daily natural language to the professional medical knowledge, and reduces the threshold of utilizing the professional medical knowledge.

In the named entity recognition stage, the novel XLNET pre-training language model which is most advanced at present and is very suitable for processing long texts is adopted, and compared with the traditional pre-training language model, the XLNET pre-training language model has higher precision and more capability of processing entities with longer sequence length. In the entity alignment stage, the similarity degree of the entities at the semantic level is calculated by adopting a similarity degree scoring model comprising a first similarity degree calculation model and a second similarity degree calculation model, and compared with the common character matching or statistical method for calculating the similarity degree, the similarity degree calculated by the method disclosed by the invention can describe the similarity degree between the two entities at the semantic level.

The dynamic knowledge-graph-based challenge detection system, apparatus, non-transitory computer-readable storage medium, and computer program product described below are all referred to in correspondence with the above-described entity-chain approach of natural language text and medical knowledge-graph.

The invention also provides an entity chain finger system of natural language text and medical knowledge graph, as shown in fig. 2, comprising:

a natural language text obtaining module 210 for obtaining a natural language text related to medical knowledge;

a target medical entity recognition module 220, configured to recognize a target medical entity on the natural language text by using a named entity recognition model, and obtain a category to which the target medical entity belongs;

a candidate medical entity obtaining module 230, configured to obtain, from the medical knowledge-graph, a plurality of candidate medical entities corresponding to categories to which the target medical entity belongs;

and the entity similarity calculation module 240 is configured to calculate similarities between the candidate medical entities and the target medical entity through a similarity scoring model, and implement the chain index between the candidate medical entities and the target medical entity according to the rank of the similarities.

Further, the entity similarity calculation module 240 includes:

Further, the similarity degree scoring model of the similarity degree calculation module comprises a first similarity degree calculation model and a second similarity degree calculation model;

the similarity calculation module includes:

the first similarity calculation module is used for calculating the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair through the first similarity calculation model to obtain a first similarity;

the second similarity calculation module is used for calculating the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair through the second similarity calculation model to obtain a second similarity;

and the final similarity calculation module is used for carrying out weighted summation on the first similarity and the second similarity to obtain the final similarity.

Further, the first similarity calculation module specifically obtains the first similarity through the following steps: and obtaining the vector of the candidate entity relationship pair, and carrying out sigmoid function calculation according to the vector of the candidate entity relationship pair and the linear transformation weight to obtain a first similarity.

Further, the second similarity calculation module specifically obtains the second similarity through the following steps: and respectively converting the candidate medical entity and the target medical entity in the same candidate entity relationship pair into a candidate medical entity vector and a target medical entity vector, and calculating a cosine value of an included angle between the candidate medical entity vector and the target medical entity vector to obtain a second similarity.

Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a method of entity chaining of natural language text and medical knowledge maps, the method comprising:

obtaining a natural language text related to medical knowledge;

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the method for entity chaining of natural language text and medical knowledge-graph provided by the above methods, the method comprising: obtaining a natural language text related to medical knowledge;

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for entity chaining of natural language text and medical knowledge-graphs provided by performing the above methods, the method comprising:

obtaining a natural language text related to medical knowledge;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An entity chain pointing method of natural language text and medical knowledge graph is characterized by comprising the following steps:

obtaining a natural language text related to medical knowledge;

2. The method of claim 1, wherein the calculating the similarity between each candidate medical entity and the target medical entity through a similarity scoring model and implementing the chain of the candidate medical entities and the target medical entity according to the rank of the similarity comprises:

respectively calculating the similarity between the candidate medical entity and the target medical entity in a plurality of candidate entity relationship pairs through a similarity scoring model;

3. The method of claim 2, wherein the similarity score model comprises a first similarity calculation model and a second similarity calculation model;

the calculating the similarity between the candidate medical entity and the target medical entity in the plurality of candidate entity relationship pairs respectively through a similarity scoring model comprises the following steps:

calculating the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair through the first similarity calculation model to obtain a first similarity;

calculating the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair through the second similarity calculation model to obtain a second similarity;

and carrying out weighted summation on the first similarity and the second similarity to obtain the final similarity.

4. The method according to claim 3, wherein the similarity between the candidate medical entity and the target medical entity in the candidate entity relationship pair is calculated through the first similarity calculation model to obtain a first similarity, specifically: and obtaining the vector of the candidate entity relationship pair, and carrying out sigmoid function calculation according to the vector of the candidate entity relationship pair and the linear transformation weight to obtain a first similarity.

5. The method according to claim 3, wherein the similarity between the candidate medical entity and the target medical entity in the same candidate entity relationship pair is calculated through the second similarity calculation model to obtain a second similarity, specifically: and respectively converting the candidate medical entity and the target medical entity in the same candidate entity relationship pair into a candidate medical entity vector and a target medical entity vector, and calculating a cosine value of an included angle between the candidate medical entity vector and the target medical entity vector to obtain a second similarity.

6. An entity chain finger system of natural language text and medical knowledge graph, comprising:

7. The system of entity chaining of natural language text and medical knowledge-graph according to claim 6, wherein said entity similarity calculation module comprises:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the entity chaining method of natural language text and medical knowledge-graph as claimed in any one of claims 1 to 5.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method for entity chaining of natural language text and medical knowledge-graph according to any of claims 1 to 5.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the entity chain referring method of natural language text and medical knowledge-graph according to any one of claims 1 to 5.