CN113919356A

CN113919356A - Method, device, storage medium and electronic equipment for identifying medical entity

Info

Publication number: CN113919356A
Application number: CN202111229618.1A
Authority: CN
Inventors: 孙小婉
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-01-11

Abstract

The present disclosure relates to a method, an apparatus, a storage medium, and an electronic device for identifying a medical entity. The method comprises the following steps: acquiring a medical text to be identified; inputting a medical entity recognition model to be recognized into the trained medical entity recognition model, and obtaining an entity class recognized from the medical text to be recognized and a character starting position and a character ending position of a corresponding medical entity under the entity class in the medical text to be recognized, wherein the medical entity recognition model comprises a coding sub-model and an entity recognition sub-model, the coding sub-model is obtained by training according to a sample medical text and a sample problem text of the corresponding sample medical text, and the entity recognition sub-model is obtained by training based on a feature vector of the sample medical text output by the coding sub-model; and determining the medical entity from the medical text to be recognized according to the starting position and the ending position. By adopting the method disclosed by the invention, the accuracy of medical entity identification can be improved.

Description

Method, device, storage medium and electronic equipment for identifying medical entity

Technical Field

The present disclosure relates to the field of entity identification technologies, and in particular, to a method, an apparatus, a storage medium, and an electronic device for identifying a medical entity.

Background

The medical text has extremely strong professionalism and abstraction. In addition to some common basic entities such as short entities like "part", "disease", "medication", etc., medical entities in medical texts also exist some complex abstract long entities such as entities like "approach to surgery", "surgical operation", "surgical margin", etc. in surgical record texts.

In the related art, the LSTM model and the CRF model are adopted to cooperatively identify the medical named entity, however, the identification method is low in accuracy.

Disclosure of Invention

An object of the present disclosure is to provide a method, an apparatus, a storage medium, and an electronic device for identifying a medical entity, so as to solve the problems in the related art.

To achieve the above object, a first part of the embodiments of the present disclosure provides a method of identifying a medical entity, the method including:

acquiring a medical text to be identified;

inputting the medical text to be recognized into a trained medical entity recognition model to obtain an entity class recognized from the medical text to be recognized and a character starting position and a character ending position of a corresponding medical entity under the entity class in the medical text to be recognized, wherein the medical entity recognition model comprises a coding sub-model and an entity recognition sub-model, the coding sub-model is obtained by training according to a sample medical text and a sample problem text corresponding to the sample medical text, and the entity recognition sub-model is obtained by training based on a feature vector of the sample medical text output by the coding sub-model;

and determining the medical entity from the medical text to be recognized according to the starting position and the ending position.

Optionally, the entity identifier model is configured to:

after the coding sub-model outputs the feature vectors of the medical texts to be recognized, determining whether the word vectors are initial word vectors corresponding to any entity type and determining whether the word vectors are ending word vectors corresponding to any entity type aiming at each word vector in the feature vectors of the medical texts to be recognized;

for each entity category, determining a head-tail vector position pair set consisting of the position of the start word vector and the position of the end word vector according to the position of each start word vector of the entity category in the feature vector of the medical text to be recognized and the position of each end word vector in the feature vector of the medical text to be recognized;

and judging whether the position of the starting word vector and the position of the ending word vector in the head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the entity category or not aiming at each head-tail vector position pair.

Optionally, in the training process of the medical entity recognition model, the coding sub-model to be trained is used for:

coding the sample medical text representing a single sample medical entity and the sample problem text corresponding to the single sample medical entity to obtain a sample feature vector;

and deleting the sample question text feature vector corresponding to the sample question text from the sample feature vector to obtain the feature vector of the sample medical text for inputting the entity recognition sub-model to be trained.

Optionally, the entity identifier sub-model includes a start character multi-classifier, an end character multi-classifier, and an entity prediction sub-model, and in the training process of the medical entity identifier model, the entity identifier sub-model to be trained is configured to:

for each sample word vector in the feature vectors of the sample medical texts, inputting the sample word vector into the initial character multi-classifier to be trained to obtain a result representing whether the sample word vector is an initial word vector or not, and the entity class corresponding to the sample word vector under the condition that the sample word vector is the initial word vector; and are

For each sample word vector in the feature vectors of the sample medical texts, inputting the sample word vector into the end character multi-classifier to be trained to obtain a result representing whether the sample word vector is an end word vector or not, and the entity class corresponding to the sample word vector under the condition that the sample word vector is the end word vector;

for each entity category, determining a head-tail vector position pair set consisting of the position of the start word vector and the position of the end word vector according to the position of each start word vector of the entity category in the feature vector of the sample medical text and the position of each end word vector in the feature vector of the sample medical text;

and inputting each head-tail vector position pair into the entity prediction sub-model to be trained to obtain an output result representing whether the position of the initial word vector and the position of the end word vector in the head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the entity category.

Optionally, in the training process of the medical entity recognition model, the method further includes:

calculating first loss information according to an output result of the initial character multi-classifier to be trained and a real label for representing whether the sample word vector is a first character of the single sample medical entity;

calculating second loss information according to the output result of the ending character multi-classifier to be trained and a real label used for representing whether the sample word vector is the tail word of the single sample medical entity;

calculating third loss information according to an output result of the entity prediction sub-model to be trained and a real label for representing whether the head-tail vector position represents the head-tail character position of the single sample medical entity;

and obtaining the trained medical entity recognition model when the weighted sum of the first loss information, the second loss information and the third loss information is minimum.

Optionally, the coding sub-model to be trained is a pre-trained BERT model.

Optionally, the sample question text is constructed based on noun interpretation of entity categories corresponding to the single sample medical entity.

According to a second aspect of embodiments of the present disclosure there is provided an apparatus for identifying a medical entity, the apparatus comprising:

the acquisition module is used for acquiring a medical text to be recognized;

the input module is used for inputting the medical text to be recognized into a trained medical entity recognition model to obtain an entity class recognized from the medical text to be recognized and a character starting position and a character ending position of a corresponding medical entity in the medical text to be recognized under the entity class, wherein the medical entity recognition model comprises a coding sub-model and an entity recognition sub-model, the coding sub-model is obtained by training according to a sample medical text and a sample problem text corresponding to the sample medical text, and the entity recognition sub-model is obtained by training based on a feature vector of the sample medical text output by the coding sub-model;

and the execution module is used for determining the medical entity from the medical text to be recognized according to the starting position and the ending position.

Optionally, the entity identifier model is configured to:

after the coding sub-model outputs the feature vectors of the medical texts to be recognized, determining whether the word vectors are initial word vectors corresponding to any entity type and determining whether the word vectors are ending word vectors corresponding to any entity type aiming at each word vector in the feature vectors of the medical texts to be recognized; for each entity category, determining a head-tail vector position pair set consisting of the position of the start word vector and the position of the end word vector according to the position of each start word vector of the entity category in the feature vector of the medical text to be recognized and the position of each end word vector in the feature vector of the medical text to be recognized; and judging whether the position of the starting word vector and the position of the ending word vector in the head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the entity category or not aiming at each head-tail vector position pair.

coding the sample medical text representing a single sample medical entity and the sample problem text corresponding to the single sample medical entity to obtain a sample feature vector; and deleting the sample question text feature vector corresponding to the sample question text from the sample feature vector to obtain the feature vector of the sample medical text for inputting the entity recognition sub-model to be trained.

for each sample word vector in the feature vectors of the sample medical texts, inputting the sample word vector into the initial character multi-classifier to be trained to obtain a result representing whether the sample word vector is an initial word vector or not, and the entity class corresponding to the sample word vector under the condition that the sample word vector is the initial word vector; inputting the sample word vector into the ending character multi-classifier to be trained aiming at each sample word vector in the feature vectors of the sample medical texts to obtain a result representing whether the sample word vector is an ending word vector or not and the entity category corresponding to the sample word vector under the condition that the sample word vector is the ending word vector; for each entity category, determining a head-tail vector position pair set consisting of the position of the start word vector and the position of the end word vector according to the position of each start word vector of the entity category in the feature vector of the sample medical text and the position of each end word vector in the feature vector of the sample medical text; and inputting each head-tail vector position pair into the entity prediction sub-model to be trained to obtain an output result representing whether the position of the initial word vector and the position of the end word vector in the head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the entity category.

Optionally, the apparatus further includes a calculating module, configured to calculate, during the training process of the medical entity recognition model, first loss information according to an output result of the initial character multi-classifier to be trained and a real label used for characterizing whether the sample word vector is a first word of the single sample medical entity; calculating second loss information according to the output result of the ending character multi-classifier to be trained and a real label used for representing whether the sample word vector is the tail word of the single sample medical entity; calculating third loss information according to an output result of the entity prediction sub-model to be trained and a real label for representing whether the head-tail vector position represents the head-tail character position of the single sample medical entity; and obtaining the trained medical entity recognition model when the weighted sum of the first loss information, the second loss information and the third loss information is minimum.

Optionally, the coding sub-model to be trained is a pre-trained BERT model.

A third part of the embodiments of the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first parts.

A fourth aspect of the embodiments of the present disclosure provides an electronic apparatus, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any of the first parts.

By adopting the technical scheme, the following beneficial technical effects can be at least achieved:

the medical entity identification method comprises the steps of obtaining a medical text to be identified, inputting the medical text to be identified into a trained medical entity identification model, obtaining an entity category identified from the medical text to be identified and a character starting position and a character ending position of a corresponding medical entity under the entity category in the medical text to be identified, and determining the medical entity from the medical text to be identified according to the starting position and the ending position. The coding sub-model in the medical entity recognition model is obtained by training according to the sample medical text and the sample problem text corresponding to the sample medical text, and the sample problem text can assist the coding sub-model to better understand the semantics of the sample medical text in the training process, so that the coding sub-model can learn the capability of coding the medical entities expressed by the same meaning (the same entity category) and different characters in the sample medical text into the same or similar feature vectors on the basis of understanding the semantics of the sample medical text. Therefore, the coding sub-model obtained by training in the mode can code the medical text to be recognized into more accurate characteristic vectors, and the entity recognition sub-model can decode the character starting position and the character ending position corresponding to the medical long entity/medical short entity more accurately according to the accurate characteristic vectors, so that the accurate medical entity is obtained. Therefore, by adopting the technical scheme disclosed by the invention, the accuracy of medical entity identification can be improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 is a flow chart illustrating a method of identifying a medical entity according to an exemplary embodiment of the present disclosure.

FIG. 2 is a diagram illustrating a medical entity recognition model architecture to be trained in accordance with an exemplary embodiment of the present disclosure.

Fig. 3 is a block diagram illustrating an apparatus for identifying a medical entity according to an exemplary embodiment of the present disclosure.

Fig. 4 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating another electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

In the related art, the LSTM model and the CRF model are employed to cooperatively identify medical named entities. The LSTM model is a long-short term memory (long-short term memory) model, and the CRF model is a Conditional Random field (Conditional Random Fields).

Because the LSTM model is trained only according to the sample medical text, the LSTM model may encode the same entity category expressed by different characters into feature vectors with huge differences, which causes that the feature vectors of a single entity may be segmented into a plurality of sub-vectors by the CRF model during word segmentation, thereby causing the problem that a long entity is broken and further the broken long entity cannot be identified. Therefore, the accuracy of this entity identification method in the related art is low.

In view of this, the present disclosure provides a method, an apparatus, a storage medium, and an electronic device for identifying a medical entity, so as to improve accuracy of medical entity identification.

Fig. 1 is a flow chart illustrating a method of identifying a medical entity according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the method of identifying a medical entity comprises the steps of:

and S11, acquiring the medical text to be recognized.

The medical text to be recognized includes one or more medical entities. In a possible case, the medical entity may not exist in the medical text to be recognized, and in this case, the medical entity recognition model may output a result representing that the medical entity does not exist.

S12, inputting the medical text to be recognized into the trained medical entity recognition model, and obtaining the entity category recognized from the medical text to be recognized and the corresponding character starting position and ending position of the medical entity under the entity category in the medical text to be recognized.

The medical entity recognition model comprises a coding sub-model and an entity recognition sub-model, wherein the coding sub-model is obtained by training according to a sample medical text and a sample problem text corresponding to the sample medical text, and the entity recognition sub-model is obtained by training based on a feature vector of the sample medical text output by the coding sub-model.

In one possible embodiment, the coding sub-model to be trained is a pre-trained BERT model.

It is worth explaining that the BERT model is a kind of migration learning model. The pre-training BERT model can be obtained by performing multi-task learning (the multi-task learning comprises a masking language model training task and a next sentence prediction task) on the basis of a bidirectional deep network Transformer. The pre-trained BERT model employed in the present disclosure may be a pre-trained BERT model already disclosed in the related art.

And performing migration training on the pre-trained BERT model according to the sample medical text and the sample problem text corresponding to the sample medical text, so that the pre-trained BERT model can be adapted to the medical text data set. The method for training the pre-training BERT model according to the sample medical texts and the sample problem texts corresponding to the sample medical texts is an unsupervised training method.

It should be noted that the principle of obtaining the coding sub-model according to the sample medical text and the sample problem text training corresponding to the sample medical text is similar to the principle understood by machine reading in the related art. Machine Reading Comprehension (MRC) is a technique that uses algorithms to make computing mechanisms solve article semantics and answer related questions. It is understood that the ability to understand natural language for a machine to answer a given question through a given context is provided.

And the coding sub-model is obtained by training according to the sample medical text and the sample problem text corresponding to the sample medical text, and the coding space of the training sub-model is more suitable for the coding of the medical entity set than the coding space obtained by training only according to the sample medical text.

S13, determining the medical entity from the medical text to be recognized according to the starting position and the ending position.

Under the condition that the corresponding character starting position and text ending position of the medical entity in the medical text to be recognized are known, the text segment representing the medical entity can be determined from the medical text to be recognized according to the starting position and the ending position of the characters.

By adopting the method disclosed by the invention, the entity category identified from the medical text to be identified and the corresponding initial position and the ending position of the medical entity corresponding to the entity category in the medical text to be identified are obtained by acquiring the medical text to be identified and inputting the medical text to be identified into the trained medical entity identification model, and the medical entity can be determined from the medical text to be identified according to the initial position and the ending position. The coding sub-model in the medical entity recognition model is obtained by training according to the sample medical text and the sample problem text corresponding to the sample medical text, and the sample problem text can assist the coding sub-model to better understand the semantics of the sample medical text in the training process, so that the coding sub-model can learn the capability of coding the medical entities expressed by the same meaning (the same entity category) and different characters in the sample medical text into the same or similar feature vectors on the basis of understanding the semantics of the sample medical text. Therefore, the coding sub-model obtained by training in the mode can code the medical text to be recognized into more accurate characteristic vectors, and the entity recognition sub-model can decode the character starting position and the character ending position corresponding to the medical long entity/medical short entity more accurately according to the accurate characteristic vectors, so that the accurate medical entity is obtained. Therefore, by adopting the technical scheme disclosed by the invention, the accuracy of medical entity identification can be improved.

Optionally, in the step S12, the entity identifier sub-model is configured to perform the following steps:

s121, after the feature vectors of the medical text to be recognized are output by the coding sub-model, determining whether the word vectors are initial word vectors corresponding to any entity type and determining whether the word vectors are ending word vectors corresponding to any entity type aiming at each word vector in the feature vectors of the medical text to be recognized.

And inputting the medical text to be recognized into the coding sub-model to obtain the feature vector of the medical text to be recognized output by the coding sub-model. It is easy to understand that the feature vector of the medical text to be recognized includes a word vector of each word in the medical text to be recognized.

The entity identification submodel determines whether the word vector is a starting word vector corresponding to any entity category and determines whether the word vector is an ending word vector corresponding to any entity category aiming at each word vector in the feature vectors of the medical texts to be identified.

Wherein any entity category refers to any one or more entity categories. Illustratively, the medical text to be recognized is "left hemihepatectomy", wherein a word vector corresponding to the word "left" may be a starting word vector of the orientation category entity "left hemihepatectomy", or may be a starting word vector of the resection manner category entity "left hemihepatectomy".

Specifically, the entity identification submodel determines, for each word vector in the feature vectors of the medical text to be identified, whether the word vector is a starting word vector, and determines which category or categories the entity category corresponding to the starting word vector corresponds to in the case where the word vector is determined to be the starting word vector. Similarly, the entity identification submodel determines, for each word vector in the feature vectors of the medical text to be identified, whether the word vector is an end word vector, and determines which category or categories the entity category corresponding to the end word vector corresponds to in the case that the word vector is determined to be the end word vector.

And S122, for each entity category, determining a head-tail vector position pair set consisting of the position of the start word vector and the position of the end word vector according to the position of each start word vector of the entity category in the feature vector of the medical text to be identified and the position of each end word vector in the feature vector of the medical text to be identified.

In specific implementation, for each entity category, according to the position of each start word vector in the feature vector of the medical text to be recognized and the position of each end word vector in the feature vector of the medical text to be recognized under the entity category, a set of head-to-tail vector position pairs consisting of the position of the start word vector and the position of the end word vector is determined, and the set may include zero, one, or multiple head-to-tail vector position pairs. For example, assume that there is a start word vector A, B, C for an entity class and a stop word vector X, Y for an entity class. The positions of the start word vector A, B, C in the feature vector of the medical text to be recognized are a, b, and c, respectively, and the positions of the end word vector X, Y in the feature vector of the medical text to be recognized are x and y, respectively. In case the feature vectors of the medical text to be recognized are characterized by a matrix, a, b, c, x, y characterize a certain column (or a certain row) in the matrix. The head and tail vector position pairs are determined to be (a, x), (a, y), (b, x), (b, y), (c, x), (c, y) according to the positions a, b and c of the start word vector A, B, C and x and y of the end word vector X, Y in the feature vector of the medical text to be recognized.

S123, judging whether the position of the initial word vector and the position of the end word vector in the head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the entity category or not aiming at each head-tail vector position pair.

And judging whether the position of the initial word vector and the position of the ending word vector in each head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the corresponding entity category or not. It is worth explaining here that the same entity class corresponds to a plurality of different medical entities. For example, in the following embodiments, the entity category in table 2 is "resection mode", and the entity category may correspond to the medical entity "left hemihepatectomy", the medical entity "caudate lobe resection", and the like.

Under the condition that the position of any head-tail vector is determined to represent the positions of head-tail characters of the same medical entity, the character starting position of the medical entity corresponding to (representing) the position of the starting character vector in the head-tail vector position pair and the character ending position of the medical entity corresponding to (representing) the position of the ending character vector in the head-tail vector position pair are output.

With the entity recognition submodel of the present disclosure, it is possible to determine, for each word vector, whether the word vector is a starting word vector and which entity class or classes the word vector is. And determines for each word vector whether the word vector is an end word vector and which one or more entity classes the end word vector is. Further, for each head-tail vector position pair in any entity category, whether the head-tail vector position pair is the head-tail position of the same medical entity in any entity category is judged. By adopting the method, all long entities and short entities in the medical text to be recognized can be recognized accurately. For example, a long entity and a short entity embedded in the long entity can be identified. For example, multiple medical entities having the same starting word (character) can be identified.

Illustratively, referring to FIG. 2, a sample medical text (x)₁、x₂…), and sample question text (q)₁、q₂…), the coding sub-model to be trained encodes the sample medical text and the sample problem text to obtain a sample feature vector E as shown in fig. 2. The sample question text feature vector corresponding to the sample question text is deleted from the sample feature vector E, resulting in the feature vector E' of the sample medical text as shown in fig. 2.

It should be noted that, in the application process of the medical entity recognition model, only the medical text to be recognized is input into the coding sub-model, so that in the application process of the medical entity recognition model, the coding sub-model does not execute the step of deleting part of the feature vectors in the training process.

A single sample medical entity characterizes that the sample is a text sample corresponding to a medical entity.

In one possible embodiment, for a single sample medical entity, the corresponding sample question text may be constructed according to the noun interpretation of the entity category (i.e., entity concept/entity name) of the single medical entity. Illustratively, the entity categories and corresponding question texts are shown in table 1 below.

Entity classes	Question text
		Surgical cutterEdge	The entity of the edge of the surgically excised tissue in the text is found.
Cutting mode	Finding out the surgical resection mode in the text.

TABLE 1

In another possible embodiment, for a single sample medical entity, the corresponding sample question text may be constructed jointly from the noun explanation of the entity category (i.e., entity concept/entity name) of the single medical entity and the text expression features of the sample medical text characterizing the single sample medical entity. Illustratively, as shown in table 2 below:

TABLE 2

Optionally, the entity identifier sub-model includes a start character multi-classifier, an end character multi-classifier, and an entity prediction sub-model, and in the training process of the medical entity identifier model, the entity identifier sub-model to be trained is configured to perform the following steps:

step one, aiming at each sample word vector in the feature vectors of the sample medical texts, inputting the sample word vector into the initial character multi-classifier to be trained to obtain a result representing whether the sample word vector is an initial word vector or not, and the entity category corresponding to the sample word vector under the condition that the sample word vector is the initial word vector.

For example, as shown in fig. 2, the start character multi-classifier and the end character multi-classifier may both adopt a softmax classifier. And each entity classification category Y belonging to the starting character multi-classifier and the ending character multi-classifier belongs to Y, wherein Y represents a list of all medical entity categories in the softmax classifier.

In specific implementation, the entity identification submodel inputs the sample word vector into an initial character multi-classifier to be trained for classification aiming at each sample word vector in the feature vectors of the sample medical texts, so as to obtain a classification result that the sample word vector is an initial word vector of one or more entity classes, or obtain a classification result that the sample word vector is not an initial word vector.

Step two, aiming at each sample word vector in the feature vectors of the sample medical texts, inputting the sample word vector into the ending character multi-classifier to be trained to obtain a result representing whether the sample word vector is an ending word vector or not, and the entity category corresponding to the sample word vector under the condition that the sample word vector is the ending word vector.

In an example, the entity identification submodel inputs the sample word vector into a final character multi-classifier to be trained for classification aiming at each sample word vector in the feature vectors of the sample medical texts, so as to obtain a classification result that the sample word vector is a final word vector of one or more entity classes, or obtain a classification result that the sample word vector is not a final word vector.

And step three, for each entity category, determining a head-tail vector position pair set consisting of the position of the start word vector and the position of the end word vector according to the position of each start word vector of the entity category in the feature vector of the sample medical text and the position of each end word vector of the entity category in the feature vector of the sample medical text.

For example, assume that there is a start word vector A, B, C for an entity class and a stop word vector X, Y for an entity class. The start word vector A, B, C has positions a, b, and c in the feature vector of the sample medical text, respectively, and the end word vector X, Y has positions x and y in the feature vector of the sample medical text, respectively. In case the characterization form of the feature vector of the sample medical text is a matrix, the positions a, b, c, x, y characterize a certain column (or a certain row) in the matrix. The head and tail vector position pairs are determined to be (a, x), (a, y), (b, x), (b, y), (c, x), (c, y) according to the positions a, b, c, respectively, of the start word vector A, B, C and the positions x, y, respectively, of the end word vector X, Y in the feature vector of the sample medical text.

An implementable embodiment determines the position of any starting word vector in the feature vector of the sample medical text by:

wherein the content of the first and second substances,

the starting word vector characterized by the ith row (or column) in the matrix composed of the starting word vectors is characterized,

the position of the start word vector (or an index characterizing the position) is characterized.

Similarly, an implementation can determine the position of any ending word vector in the feature vector of the sample medical text by:

wherein the content of the first and second substances,

the end word vector characterized by the jth row (or column) in the matrix composed of end word vectors is characterized,

the position of the end word vector (or an index characterizing the position) is characterized.

Inputting each head-tail vector position pair into the entity prediction sub-model to be trained to obtain an output result representing whether the position of the initial word vector and the position of the end word vector in the head-tail vector position pair represent the positions of head-tail characters of the same medical entity under the entity category.

Exemplarily, the head-tail vector position pair (a, x) is input into the entity predictor model to be trained, so as to obtain an output result representing whether the position a of the start word vector and the position x of the end word vector in the head-tail vector position pair (a, x) represent the positions of the head-tail characters of the same medical entity under the corresponding entity category. If yes, the start word vector A and the end word vector X corresponding to the head-tail vector position pair (a, X) are the encoding vectors of the head-tail characters of the same entity.

Optionally, in the training process of the medical entity recognition model, the method further includes the following steps:

and calculating first loss information according to the output result of the initial character multi-classifier to be trained and a real label for representing whether the sample word vector is the first character of the single sample medical entity.

Illustratively, the formula for calculating the first loss information may be characterized as loss_start＝cross_entropy(P_start,Y_start) Wherein P is_start＝softmax_{each_row}(E’_n,C_start)，E’_nCharacterizing the position of the n-th sample word vector, C_startHyper-parameter, P, characterizing a starting character multi-classifier_startCharacterization of the results of the classification of the sample word vectors by the initial character multi-classifier, Y_startA true label characterizing the sample word vector.

And calculating second loss information according to the output result of the ending character multi-classifier to be trained and a real label for representing whether the sample word vector is the tail word of the single sample medical entity.

Illustratively, the formula for calculating the second loss information may be characterized as loss_end＝cross_entropy(P_end,Y_end) Wherein P is_end＝softmax_{each_row}(E’_n，C_end) Wherein, E'_nCharacterizing the position of the n-th sample word vector, C_endHyper-parameter, P, characterizing an end character multi-classifier_endMultiple classifications of end-of-token charactersResult of classification of sample word vectors by the machine, Y_endA true label characterizing the sample word vector.

And calculating third loss information according to an output result of the entity prediction sub-model to be trained and a real label for representing whether the head-tail vector position represents the head-tail character position of the single sample medical entity.

Illustratively, the formula for calculating the third loss information may be characterized as loss_span＝cross_entropy(P_start,end,Y_start,end) Wherein, in the step (A),

wherein the content of the first and second substances,

characterizing the position of the ith start word vector, wherein,

E'_jendcharacterizing the position of the jth end word vector, wherein

(in the same way as,

)，P_start,endoutput of the characterization entity predictor model, Y_start,endAnd (4) representing real labels of head-tail vector position pairs.

For example, a weighted sum of the first loss information, the second loss information, and the third loss information may be characterized as having a loss α -loss_start+β·loss_end+γ·loss_spanWherein α, β, γ ∈ [0,1 ]]Training the model with the hyperparameter to control the importance degree of the loss of the three parts, wherein the training process corresponds to the minimum weighted sum valueThe medical entity recognition model is a trained medical entity recognition model.

By adopting the method and the LSTM + CRF mode in the related technology to carry out experiments on the same medical text data set to be identified (particularly the liver cancer operation record data set), the following experiment results are obtained:

Model	F1
		LSTM+CRF	83.84
methods of the present disclosure	91.11

TABLE 3

Wherein F1 is the result of weighting the accuracy and the recall. As can be seen from table 3, this method of identifying a medical entity of the present disclosure works better than the method in the related art.

Based on the same inventive concept, the disclosed embodiment further provides an apparatus for identifying a medical entity, as shown in fig. 3, wherein the apparatus 300 further includes:

an obtaining module 310, configured to obtain a medical text to be recognized;

the input module 320 is configured to input the medical text to be recognized into a trained medical entity recognition model, and obtain an entity category recognized from the medical text to be recognized and a character start position and a character end position of a corresponding medical entity in the medical text to be recognized under the entity category, where the medical entity recognition model includes a coding sub-model and an entity recognition sub-model, the coding sub-model is trained according to a sample medical text and a sample question text corresponding to the sample medical text, and the entity recognition sub-model is trained based on a feature vector of the sample medical text output by the coding sub-model;

an executing module 330, configured to determine the medical entity from the medical text to be recognized according to the starting position and the ending position.

By adopting the device disclosed by the invention, the entity category identified from the medical text to be identified and the corresponding character starting position and ending position of the medical entity corresponding to the entity category in the medical text to be identified are obtained by acquiring the medical text to be identified and inputting the medical text to be identified into the trained medical entity identification model, and the medical entity can be determined from the medical text to be identified according to the starting position and the ending position. The coding sub-model in the medical entity recognition model is obtained by training according to the sample medical text and the sample problem text corresponding to the sample medical text, and the sample problem text can assist the coding sub-model to better understand the semantics of the sample medical text in the training process, so that the coding sub-model can learn the capability of coding the medical entities expressed by the same meaning (the same entity category) and different characters in the sample medical text into the same or similar feature vectors on the basis of understanding the semantics of the sample medical text. Therefore, the coding sub-model obtained by training in the mode can code the medical text to be recognized into more accurate characteristic vectors, and the entity recognition sub-model can decode the character starting position and the character ending position corresponding to the medical long entity/medical short entity more accurately according to the accurate characteristic vectors, so that the accurate medical entity is obtained. Therefore, by adopting the technical scheme disclosed by the invention, the accuracy of medical entity identification can be improved.

Optionally, the entity identifier model is configured to:

Optionally, the coding sub-model to be trained is a pre-trained BERT model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The disclosed embodiments also provide a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method of any of the above embodiments.

Fig. 4 is a block diagram illustrating an electronic device 700 according to an example embodiment. As shown in fig. 4, the electronic device 700 may include: a processor 701 and a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

The processor 701 is configured to control the overall operation of the electronic device 700 to complete all or part of the steps of the method for identifying a medical entity. The memory 702 is used to store various types of data to support operation at the electronic device 700, such as instructions for any application or method operating on the electronic device 700 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 702 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia components 703 may include screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 702 or transmitted through the communication component 705. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is used for wired or wireless communication between the electronic device 700 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 705 may thus include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described method for identifying medical entities.

In another exemplary embodiment, a computer-readable storage medium comprising program instructions which, when executed by a processor, carry out the steps of the above-described method of identifying a medical entity is also provided. For example, the computer readable storage medium may be the memory 702 described above comprising program instructions executable by the processor 701 of the electronic device 700 to perform the method of identifying a medical entity described above.

Fig. 5 is a block diagram illustrating an electronic device 1900 according to an example embodiment. For example, the electronic device 1900 may be provided as a server. Referring to fig. 5, an electronic device 1900 includes a processor 1922, which may be one or more in number, and a memory 1932 for storing computer programs executable by the processor 1922. The computer program stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processor 1922 may be configured to execute the computer program to perform the above-described method of identifying a medical entity.

Additionally, electronic device 1900 may also include a power component 1926 and a communication component 1950, the power component 1926 may be configured to perform power management of the electronic device 1900, and the communication component 1950 may be configured to enable communication, e.g., wired or wireless communication, of the electronic device 1900. In addition, the electronic device 1900 may also include input/output (I/O) interfaces 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932^TM，Mac OS X^TM，Unix^TM，Linux^TMAnd so on.

In another exemplary embodiment, a computer-readable storage medium comprising program instructions which, when executed by a processor, carry out the steps of the above-described method of identifying a medical entity is also provided. For example, the non-transitory computer readable storage medium may be the memory 1932 described above that includes program instructions executable by the processor 1922 of the electronic device 1900 to perform the method of identifying a medical entity described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned method of identifying a medical entity when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of identifying a medical entity, the method comprising:

acquiring a medical text to be identified;

2. The method of claim 1, wherein the entity identifier model is configured to:

3. The method of claim 1, wherein during the training of the medical entity recognition model, the coding submodel to be trained is used to:

4. The method according to claim 3, wherein the entity recognition sub-model comprises a start character multi-classifier, an end character multi-classifier, and an entity prediction sub-model, and wherein during the training of the medical entity recognition model, the entity recognition sub-model to be trained is used for:

5. The method of claim 4, wherein during the training of the medical entity recognition model, further comprising:

6. The method according to any of claims 1-5, characterized in that the coding sub-model to be trained is a pre-trained BERT model.

7. The method of claim 3, wherein the sample question text is constructed based on noun interpretations of entity classes to which the single sample medical entity corresponds.

8. An apparatus for identifying a medical entity, the apparatus comprising:

the acquisition module is used for acquiring a medical text to be recognized;

9. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.