CN111415748B

CN111415748B - Entity linking method and device

Info

Publication number: CN111415748B
Application number: CN202010099197.4A
Authority: CN
Inventors: 史亚飞
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2023-08-08
Anticipated expiration: 2040-02-18
Also published as: CN111415748A

Abstract

The invention discloses an entity linking method and device, comprising the following steps: acquiring a current medical text, and determining medical terms to be linked from the current medical text; obtaining a current word vector based on medical terms to be linked; comparing the similarity of the current word vector and the preset word vector, and outputting the comparison similarity; determining a current medical entity of the medical term to be linked according to the comparison similarity; the medical term to be linked is linked with the current medical entity. Compared with the prior art, the word vector is more diversified, the analysis result is not limited to one type, the most suitable result is selected from a plurality of results to serve as the current medical text, the situation that analysis accuracy is too low to effectively obtain a physiotherapy entity or an incorrect medical entity due to the fact that the CRF identifies the semantic component of the entity to be linked is too single is avoided, and accuracy is improved.

Description

Entity linking method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for entity linking.

Background

In the processing of clinical medical record big data, due to differences of regions, hospitals, doctors, standards and the like, the same entity often has a large number of different expression modes, and the data can be effectively counted and calculated only by accurately identifying the same entity and aiming at a limited entity space. Thus, the medical term entity linking is an essential part of the data processing process.

At present, the existing entity linking method generally obtains candidate quantity through an N-gram algorithm, CRF identifies semantic components of entities to be linked in the candidate quantity and semantic components of candidate standard terms to be matched, and finally obtains standard terms with highest similarity by means of synonymous relations of semantic components of a knowledge graph. However, this method has the following disadvantages: the CRF recognizes that too single semantic component of the entity to be linked leads to too low resolution accuracy and can not effectively obtain the physiotherapy entity or obtain the wrong medical entity.

Disclosure of Invention

Aiming at the displayed problems, the method determines the medical terms to be linked based on the current medical text, obtains the current word vector of the medical terms to be linked, compares the current word vector with the preset word vector, and further determines the current medical entity in the medical terms to be linked and links the current medical entity with the medical terms to be linked.

An entity linking method, comprising the steps of:

acquiring a current medical text, and determining medical terms to be linked from the current medical text;

obtaining a current word vector based on the medical terms to be linked;

comparing the similarity of the current word vector and a preset word vector, and outputting comparison similarity;

determining the current medical entity of the medical term to be linked according to the comparison similarity;

and linking the medical term to be linked with the current medical entity.

Preferably, the obtaining the current medical text, determining the medical term to be linked from the current medical text, includes:

extracting all first medical terms from the current medical text;

inputting the first medical term into a preset knowledge graph for retrieval;

and determining the medical terms to be linked through retrieval.

Preferably, the obtaining the current word vector based on the medical terms to be linked includes:

preprocessing the medical terms to be linked, and converting English components in the medical terms to be linked into corresponding Chinese;

calculating the label score of each Chinese in the medical terms to be linked by using the following formula:

wherein, x= (X1, X2,) represents an input sequence of each word in the medical term to be linked, y= (y 1, y2,) represents an output sequence of each word in the medical term to be linked, the followingRepresenting the probability that the input is xi and the output is label yi, said +.>Representing the probability of a transition from the tag yi to tag yi+1;

selecting the output sequence with the highest score as the current label of the medical term to be linked;

extracting n first semantic components of the current tag;

training word vectors of each first semantic component in the n first semantic components by using a preset model;

and determining the word vector of each semantic component as the current word vector.

Preferably, the comparing the similarity between the current word vector and the preset word vector, outputting a comparison similarity, and the method includes:

determining medical concepts corresponding to the medical terms to be linked;

retrieving all second medical terms related to the medical concept from the preset knowledge graph;

extracting a second semantic component in the second medical term;

training word vectors corresponding to the second semantic components by using the preset model;

determining a word vector corresponding to the second semantic component as a preset word vector;

calculating the similarity between the current word vector and the preset word vector by using the following formula:

wherein cos θ is the similarity between the current word vector and the preset word vector, and a is as follows ₁ 、a ₂ 、a _n For n word vectors of the current word vector, the b ₁ 、b ₂ 、b _n And n word vectors in the preset word vectors are used.

Preferably, the determining the current medical entity of the medical term to be linked according to the comparison similarity includes:

confirming whether the similarity is larger than or equal to a preset threshold value;

if yes, confirming whether the similarity is hundred percent;

if yes, determining a current medical entity corresponding to the current word vector according to the preset medical entity corresponding to the preset word vector;

otherwise, judging whether the current word vector and the preset word vector meet preset conditions or not:

otherwise, prompting that there is no matching current medical entity.

An entity linking apparatus, the apparatus comprising:

the acquisition module is used for acquiring a current medical text and determining medical terms to be linked from the current medical text;

the obtaining module is used for obtaining a current word vector based on the medical terms to be linked;

the comparison module is used for comparing the similarity of the current word vector and the preset word vector and outputting comparison similarity;

the determining module is used for determining the current medical entity of the medical term to be linked according to the comparison similarity;

and the link module is used for linking the medical term to be linked with the current medical entity.

Preferably, the acquiring module includes:

a first extraction sub-module for extracting all first medical terms from the current medical text;

the first retrieval submodule is used for inputting the first medical term into a preset knowledge graph for retrieval;

a first determining sub-module for determining the medical term to be linked by retrieving.

Preferably, the obtaining module includes:

the preprocessing sub-module is used for preprocessing the medical terms to be linked and converting English components in the medical terms to be linked into corresponding Chinese;

a first calculation sub-module, configured to calculate a label score of each chinese in the medical terms to be linked by using the following formula:

the selecting submodule is used for selecting the output sequence with the highest score as the current label of the medical term to be linked;

the second extraction submodule is used for extracting n first semantic components of the current tag;

the first training submodule is used for training word vectors of each first semantic component in the n first semantic components by using a preset model;

and the second determining submodule is used for determining the word vector of each semantic component as the current word vector.

Preferably, the comparing module includes:

a third determining submodule, configured to determine a medical concept corresponding to the medical term to be linked;

the second retrieval sub-module is used for retrieving all second medical terms related to the medical concept from the preset knowledge graph;

a third extraction sub-module for extracting a second semantic component in the second medical term;

the second training submodule is used for training word vectors corresponding to the second semantic components by using the preset model;

a third determining submodule, configured to determine a word vector corresponding to the second semantic component as a preset word vector;

the second computing sub-module is used for computing the similarity between the current word vector and the preset word vector by using the following formula:

Preferably, the determining module includes:

the first confirming sub-module is used for confirming whether the similarity is larger than or equal to a preset threshold value;

the second confirming sub-module is used for confirming whether the similarity is hundred percent or not when the first confirming sub-module confirms that the similarity is larger than or equal to the preset threshold value;

a fourth determining submodule, configured to determine, according to a preset medical entity corresponding to the preset word vector, a current medical entity corresponding to the current word vector if the second confirming submodule confirms that the similarity is one hundred percent;

the judging sub-module is used for judging whether the current word vector and the preset word vector meet preset conditions or not when the second confirming sub-module does not confirm that the similarity is hundred percent:

if yes, the fourth determining submodule determines a current medical entity corresponding to the current word vector according to the preset medical entity corresponding to the preset word vector;

and the prompting submodule is used for prompting that the current medical entity is not matched if the judging submodule judges that the current word vector and the preset word vector do not meet the preset condition.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and together with the embodiments of the invention and do not constitute a limitation to the invention, and in which:

FIG. 1 is a workflow diagram of an entity linking method provided by the present invention;

FIG. 2 is another workflow diagram of an entity linking method provided by the present invention;

FIG. 3 is a block diagram of an entity linking device according to the present invention;

FIG. 4 is a diagram showing another embodiment of a physical link device according to the present invention;

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

At present, the existing entity linking method generally obtains candidate quantity through an N-gram algorithm, CRF identifies semantic components of entities to be linked in the candidate quantity and semantic components of candidate standard terms to be matched, and finally obtains standard terms with highest similarity by means of synonymous relations of semantic components of a knowledge graph. However, this method has the following disadvantages: the CRF recognizes that too single semantic component of the entity to be linked leads to too low resolution accuracy and can not effectively obtain the physiotherapy entity or obtain the wrong medical entity. In order to solve the above-mentioned problems, the present embodiment discloses a method for determining a current medical entity in medical terms to be linked and linking the medical terms to be linked based on acquiring a current medical text and acquiring a current word vector of the medical terms to be linked to compare with a preset word vector.

An entity linking method, as shown in fig. 1, comprises the following steps:

step S101, acquiring a current medical text, and determining medical terms to be linked from the current medical text;

step S102, obtaining a current word vector based on medical terms to be linked;

step S103, comparing the similarity of the current word vector and the preset word vector, and outputting the comparison similarity;

step S104, determining the current medical entity of the medical term to be linked according to the comparison similarity;

step S105, linking the medical term to be linked with the current medical entity;

in this embodiment, the current medical entity refers to which field the medical term to be linked currently belongs to, and common medical entities have medical history characteristics, hospitalization conditions, treatment projects, diagnostic records, and the like. Linking the medical term to be linked with the current medical entity refers to classifying the medical term with the link into the current medical entity, and the medical term to be linked can be displayed only by clicking the current medical entity on the platform.

The working principle of the technical scheme is as follows: firstly, acquiring a current medical text, determining medical terms to be linked from the current medical text, then acquiring a current word vector according to the medical terms to be linked, comparing the similarity of the current word vector and a preset word vector, determining a current medical entity of the medical terms to be linked according to the compared similarity, and finally linking the medical terms to be linked with the current medical entity.

The beneficial effects of the technical scheme are as follows: the word vector of the medical term to be linked is compared with the preset word vector to determine the current medical entity, compared with semantic components in the prior art, the word vector is more diversified, the analysis result is not limited to one type, the most suitable result is screened from multiple types of results to serve as the current medical text, compared with the prior art, the situation that the analysis accuracy is too low to effectively obtain physical therapy entities or incorrect medical entities due to the fact that the CRF identifies the semantic components of the entity to be linked is too single is avoided, and the accuracy is improved.

In one embodiment, as shown in fig. 2, obtaining a current medical text, determining medical terms to be linked from the current medical text, includes:

step S201, extracting all first medical terms from the current medical text;

step S202, inputting a first medical term into a preset knowledge graph for retrieval;

step S203, medical terms to be linked are determined through retrieval.

The beneficial effects of the technical scheme are as follows: the medical terms to be linked are found out according to the individual consulting of the historical medical texts without manpower, so that the labor capacity of medical staff is reduced, and the retrieval result in the knowledge graph is more accurate compared with the manual consulting and exclusion.

In one embodiment, obtaining the current word vector based on the medical terms to be linked includes:

the label score for each chinese in the medical term to be linked is calculated using the following formula:

wherein x= (X1, X2,) X, xn, represents the input sequence of each word in the medical term to be linked, y= (y 1, y2,) represents the output sequence of each word in the medical term to be linked,representing the probability that the input is xi and the output is label yi,>representing the probability of transition from tag yi to tag yi+1;

extracting n first semantic components of the current tag;

the word vector for each semantic component is determined to be the current word vector.

The beneficial effects of the technical scheme are as follows: the current word vector is determined through accurate calculation, the accuracy of the word vector is improved, the most accurate word vector is provided for comparison with the preset word vector at the back, and the accuracy of determining the current medical entity is also ensured. Avoiding the situation that the wrong medical entity is obtained.

In one embodiment, comparing the similarity of the current word vector and the preset word vector, and outputting the comparison similarity includes:

determining medical concepts corresponding to the medical terms to be linked;

retrieving all second medical terms related to the medical concept from a preset knowledge graph;

extracting a second semantic component in a second medical term;

training word vectors corresponding to the second semantic components by using a preset model;

determining word vectors corresponding to the second semantic components as preset word vectors;

the similarity between the current word vector and the preset word vector is calculated by using the following formula:

wherein cos θ is the similarity between the current word vector and the preset word vector, a ₁ 、a ₂ 、a _n B for n word vectors in the current word vector ₁ 、b ₂ 、b _n Is n word vectors in the preset word vectors.

The beneficial effects of the technical scheme are as follows: by comparing the similarity, whether the medical terms to be linked have other medical terms of the same category in the knowledge graph can be determined, and then the current medical entity can be determined according to the similarity, so that a great number of medical entities do not need to be turned over to search the medical entity corresponding to the medical term to be linked, and time is saved.

In one embodiment, determining the current medical entity of the medical term to be linked according to the comparative similarity comprises:

if yes, confirming whether the similarity is hundred percent;

if yes, determining a current medical entity corresponding to the current word vector according to a preset medical entity corresponding to the preset word vector;

otherwise, prompting that the current medical entity is not matched;

in this embodiment, the preset threshold may be eighty percent, and the preset condition may be: 1: the current word vectors comprise all preset word vectors, the number of the current word vectors is more than that of the preset word vectors, the surfaces of some word vectors in the 2 current word vectors are different from those of some word vectors in the preset word vectors, but some word vectors in the current word vectors are the upper positions of some word vectors in the preset word vectors, and the residual word vectors in the current word vectors comprise the residual word vectors in the preset word vectors.

The beneficial effects of the technical scheme are as follows: the similarity is determined twice to ensure the difference between the current word vector and the preset word vector, so that the accuracy is further improved, and if the current word vector and the preset word vector are equal, the preset medical entity corresponding to the preset word vector can be directly determined as the current medical entity of the medical term to be linked, so that the matching time is saved, and the efficiency is improved.

In one embodiment, the method comprises:

step 1: extracting medical terms from the medical text, including diseases, operations and the like, and determining medical terms to be linked which are not present in the knowledge graph;

step 2: preprocessing the medical terms to be linked, and converting English components into corresponding Chinese. For example: "IABP implantation" is converted to "aortic balloon counterpulsation implantation";

step 3: for the pretreated medical term obtained in the step 1, semantic component analysis is performed by using Bert+BiLSTM+CRF, for example, the "blepharotomy" can be resolved into site-eyelid and surgical-incision. The concept of Bert+BiLSTM+CRF is that Word vectors trained by Bert are used for replacing Word2Vec vectors of BiLSTM, a BiLSTM model is used for calculating most probable labels of current words, and the CRF ensures the sequency among the labels by using transfer characteristics;

the predictive score formula is as follows:

wherein x= (X ₁ ，x ₂ ，…，x _n ) Representing the input sequence of BiLSTM, y= (y) ₁ ，y ₂ ，…，y _n ) An output tag sequence is shown.Representing input x _i The softmax layer output label at BiLSTM is y _i Probability of->Representing the slave tag y _i To y _i+1 Is a transition probability of (2);

the tag sequence with the highest score is selected as the tag of the input sequence, for example:

eye (B-body part) eyelid (I-body part) cut (B-shhi) open (I-shhi) procedure (I-shhi). Semantic components can be extracted further: site-eyelid, surgical-incision;

step 4: training each semantic component (such as eyelid, incision) obtained in step 3 into word vector by using Bert model, for example, eyelid can use m-dimensional word vector A= (a) ₁ ，a ₂ ，..，a _m ) A representation;

step 5: extracting semantic component types (such as parts, operation type and the like) analyzed in the 3 rd step corresponding to medical concepts (such as operations) to be linked with medical terms (such as palpebral incision) in the knowledge graph, and training word vectors by using a Bert model, wherein for example, the eyelid margin can be trained by using an m-dimensional word vector B= (B) ₁ ，b ₂ ，..，b _m ) A representation;

step 6: and combining cosine similarity, and linking the semantic component B of the knowledge graph of the 4 th step semantic component A in the 5 th step. The cosine similarity is given by:

if cos theta is more than xi and xi is a threshold value, then A is considered as synonymous with B. If a plurality of B meet the condition, B with the highest similarity is selected as the synonym of A. If a is a site eyelid or an surgical incision, the same site eyelid or surgical incision can be found in the knowledge graph;

step 7: linking the medical entities in the corresponding knowledge maps of the entities to be linked based on the ontology reasoning logic;

ontology inference logic refers to the relationship between two entities that is determined by the inclusion of attributes behind the entities and the context. For entity P (e.g., blepharotomy, based on step 6, including attribute components in the corresponding knowledge-graph: site-eyelid, surgical-incision) and entity Q (e.g., surgical entity blepharotomy in the knowledge-graph, including attributes: site-blephare and surgical-incision);

if P is synonymous with Q: the number of the attributes of P and Q is the same, and the attributes are completely synonymous; facial blepharotomy is identical in number to blepharotomy in nature, and is identical in number to blepharotomy, but the eyelid is not synonymous with blepharotomy in part, so that blepharotomy is not synonymous with blepharotomy;

if P is an upper entity of Q, one of the following two conditions needs to be satisfied: the attributes of 1.P include all of the attributes in Q, and the number of attributes of P is greater than the number of attributes of Q. Some attributes of P cannot find synonymous attributes in Q, but can find the upper attributes of Q attributes. While other attributes of P include other attributes of Q.

The site eyelid of P is the superior attribute of the site blephar of Q, and the other attribute-wise incision of P includes the other attribute-wise incision of Q, so the blepharotomy is the superior entity of the blepharotomy.

The beneficial effects of the technical scheme are as follows: the Bert+BiLSTM+CRF deep learning model is applied to NER, so that more features in the text can be extracted, and component analysis is more accurate. By means of ontology reasoning of the knowledge graph, the link entity is more accurate and has more interpretability.

The embodiment also discloses an entity linking device, as shown in fig. 3, which comprises:

an obtaining module 301, configured to obtain a current medical text, and determine a medical term to be linked from the current medical text;

an obtaining module 302, configured to obtain a current word vector based on the medical terms to be linked;

the comparison module 303 is configured to compare the similarity between the current word vector and the preset word vector, and output a comparison similarity;

a determining module 304, configured to determine a current medical entity of the medical terms to be linked according to the comparison similarity;

a linking module 305 for linking the medical term to be linked with the current medical entity.

In one embodiment, as shown in fig. 4, the acquisition module includes:

a first extraction sub-module 3011, configured to extract all first medical terms from the current medical text;

the first retrieval submodule 3012 is used for inputting a first medical term into a preset knowledge graph for retrieval;

a first determination submodule 3013 for determining the medical terms to be linked by retrieval.

In one embodiment, the obtaining module comprises:

a first calculation sub-module for calculating a label score for each chinese in the medical terms to be linked using the following formula:

the selecting sub-module is used for selecting the output sequence with the highest score as the current label of the medical term to be linked;

In one embodiment, the comparison module comprises:

a third determining submodule, configured to determine medical concepts corresponding to medical terms to be linked;

the second retrieval sub-module is used for retrieving all second medical terms related to the medical concept from a preset knowledge graph;

a third extraction sub-module for extracting a second semantic component in a second medical term;

the second training submodule is used for training word vectors corresponding to the second semantic components by using a preset model;

the third determining submodule is used for determining word vectors corresponding to the second semantic components as preset word vectors;

wherein cos θ is the similarity between the current word vector and the preset word vector, a ₁ 、a ₂ 、a _n For the current word vectorN word vectors of (b) ₁ 、b ₂ 、b _n Is n word vectors in the preset word vectors.

In one embodiment, the determining module includes:

the second confirmation sub-module is used for confirming whether the similarity is hundred percent or not if the similarity confirmed by the first confirmation sub-module is larger than or equal to a preset threshold value;

a fourth determining submodule, configured to determine, according to a preset medical entity corresponding to the preset word vector, a current medical entity corresponding to the current word vector if the second determining submodule determines that the similarity is hundred percent;

the judging sub-module is used for judging whether the current word vector and the preset word vector meet the preset condition or not when the second confirming sub-module does not confirm that the similarity is hundred percent:

if yes, the fourth determining submodule determines the current medical entity corresponding to the current word vector according to the preset medical entity corresponding to the preset word vector;

and the prompting sub-module is used for prompting that the current medical entity is not matched when the judging sub-module judges that the current word vector and the preset word vector do not meet the preset conditions.

It will be appreciated by those skilled in the art that the first and second aspects of the present invention refer to different phases of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of physical linking comprising the steps of:

obtaining a current word vector based on the medical terms to be linked;

linking the medical term to be linked with the current medical entity;

and comparing the similarity of the current word vector and the preset word vector, and outputting the comparison similarity, wherein the comparison similarity comprises the following steps:

determining medical concepts corresponding to the medical terms to be linked;

extracting a second semantic component in the second medical term;

；

wherein the saidFor the similarity of the current word vector and the preset word vector, the a ₁ 、a ₂ 、a _n For n word vectors of the current word vector, the b ₁ 、b ₂ 、b _n N word vectors in the preset word vectors;

the obtaining the current medical text, determining the medical term to be linked from the current medical text, comprises the following steps:

extracting all first medical terms from the current medical text;

inputting the first medical term into a preset knowledge graph for retrieval;

determining the medical term to be linked through retrieval;

the obtaining the current word vector based on the medical term to be linked comprises the following steps:

；

wherein, x= (X1, X2,) represents an input sequence of each word in the medical term to be linked, y= (y 1, y2,) represents an output sequence of each word in the medical term to be linked, the followingRepresenting the probability that the input is xi and the output is label yi, said +.>Representing the probability of converting from the label yi to the label yi+1, n representing the number of labels, i representing the ith label;

extracting n first semantic components of the current tag;

2. The entity linking method according to claim 1, wherein said determining the current medical entity of the medical term to be linked according to the comparative similarity comprises:

if yes, confirming whether the similarity is hundred percent;

otherwise, prompting that there is no matching current medical entity.

3. An entity linking apparatus, comprising:

a linking module for linking the medical term to be linked with the current medical entity;

the comparison module comprises:

；

the acquisition module comprises:

a first determination sub-module for determining the medical terms to be linked by retrieving;

the obtaining module comprises:

；

4. The entity-linking apparatus of claim 3, wherein the determining module comprises: