CN114169338A

CN114169338A - Medical named entity identification method and device and electronic equipment

Info

Publication number: CN114169338A
Application number: CN202210125810.4A
Authority: CN
Inventors: 安波
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-03-11
Anticipated expiration: 2042-02-10
Also published as: CN114169338B

Abstract

The invention discloses a medical named entity identification method, a medical named entity identification device and electronic equipment. The method comprises the following steps: training by utilizing a labeling data set to obtain a plurality of named entity recognition NER models of different types; selecting data to be labeled from unlabeled data by using an active learning method based on a plurality of NER models; predicting the category of the data to be labeled by utilizing a plurality of NER models respectively; and fusing the predicted results to obtain the category of the data to be labeled. The technical scheme realizes the effect of achieving equivalent performance of a large amount of data by using a small amount of data. Actual use data shows that the method provided by the invention can achieve the performance of about 90% of full data under 10% of labeled data. Therefore, the method of the invention well meets the actual requirements of the information extraction application scene under the condition that the medical scene lacks enough labeling information.

Description

Medical named entity identification method and device and electronic equipment

Technical Field

The invention relates to the technical field of medical data processing, in particular to a medical named entity identification method and device and electronic equipment.

Background

Named Entity Recognition (NER) in the medical field is a foundation for constructing medical knowledge maps and medical big data and is an important foundation for realizing intelligent analysis of cases and medical intellectualization.

At present, medical NER tasks are mainly realized by applying deep learning technology. In the application process of the deep learning technology, a large amount of labeled data is needed to train the model. Medical data is scarce due to privacy and sensitivity of the medical data, and data labeled for recognition by named entities is scarce. Therefore, the deep learning technology meets a great bottleneck on the medical NER task, and the medical NER task under the condition of a small amount of labeled data cannot be met.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

The invention provides a medical named entity identification method on one hand, which comprises the following steps:

training by utilizing a labeling data set to obtain a plurality of named entity recognition NER models of different types;

selecting data to be labeled from unlabeled data by using an active learning method based on a plurality of NER models;

predicting the category of the data to be labeled by utilizing a plurality of NER models respectively;

and fusing the predicted results to obtain the category of the data to be labeled.

Preferably, the plurality of named entity recognition NER models of different types comprises: deep learning models, statistical learning models, and/or knowledge-based models.

Preferably, the selecting data to be labeled from unlabeled data by using an active learning method based on the plurality of NER models includes:

respectively predicting the distribution of the unlabeled data in each category by using each NER model;

calculating the distribution consistency of the unlabeled data in each category;

and determining the data to be labeled from all the unlabeled data according to the consistency.

Preferably, the consistency of the distribution of each unlabeled data in each category is calculated; determining data to be marked from all unmarked data according to the consistency, and adopting the following formula:

in the formula (I), the compound is shown in the specification,

in order to not label the data,

is the M-th entity class, M is the total number of entity classes,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The number of the NER models is determined,

) Is as follows

Predicted by NER model

Is the probability of the mth category, D is the KL distance of the two distributions,

the data with the largest KL distance in all the finally obtained unlabeled data is obtained.

Preferably, the predicted result is fused to obtain the category of the data to be labeled, and the following formula is adopted:

in the formula (I), the compound is shown in the specification,

for unlabelled data

In the final category of the video data to be displayed,

the number of the NER models is,

is as follows

The number of the NER models is determined,

for the m-th entity class,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The weights of the individual NER models are such that,

are learnable parameters.

Preferably, the method further comprises the steps of:

and labeling the data to be labeled by using the obtained categories, adding the data to be labeled into the labeled data set, and iteratively training a plurality of NER models.

The invention provides a medical named entity identification method in a second aspect, which comprises the following steps:

inputting data into a plurality of named entity recognition NER models to obtain a plurality of recognition results; a plurality of NER models are obtained by training according to the method;

and fusing the plurality of identification results to obtain a final entity identification result.

A third aspect of the present invention provides a medical named entity recognition apparatus, comprising:

the NER model training module is used for training a plurality of named entity recognition NER models of different types by utilizing the labeling data set;

the to-be-labeled data selection module is used for selecting data to be labeled from the unlabeled data by utilizing an active learning method based on the NER models;

the data to be labeled category prediction module is used for predicting the category of the data to be labeled by utilizing the NER models respectively;

and the prediction result fusion module is used for fusing the prediction result to obtain the category of the data to be labeled.

The invention also provides a memory storing a plurality of instructions for implementing the method as described above.

The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions which can be loaded and executed by the processor to enable the processor to execute the method.

The invention has the beneficial effects that: according to the technical scheme provided by the invention, a plurality of NER models are obtained by utilizing a small amount of medical labeling data for training, data with the strongest model uncertainty in unlabeled data are selected by utilizing an active learning method based on the NER models, the data labels are given by fusing the prediction results of the NER models, and finally the labeled data are added into a training data set to optimize the models. Finally, the effect of achieving equivalent performance of a large amount of data by using a small amount of data is achieved. Actual use data shows that the method provided by the invention can achieve the performance of about 90% of full data under 10% of labeled data. Therefore, the method of the invention well meets the actual requirements of the information extraction application scene under the condition that the medical scene lacks enough labeling information.

Drawings

FIG. 1 is a schematic flow chart of a medical named entity recognition method according to the present invention;

FIG. 2 is a schematic diagram of an exemplary implementation of the medical named entity recognition method according to the invention;

FIG. 3 is a schematic view illustrating a process of identifying a named entity in unlabeled data according to the present invention;

fig. 4 is a functional structure diagram of the medical named entity recognition device according to the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.

A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.

The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.

The display screen is used for displaying user interfaces of all the application programs.

In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.

Example one

As shown in fig. 1-2, an embodiment of the present invention provides a medical named entity identification method, including:

s101, training by using a labeling data set to obtain a plurality of named entity recognition NER models of different types;

s102, selecting data to be labeled from unlabeled data by using an active learning method based on the NER models;

s103, predicting the category of the data to be labeled by utilizing the NER models respectively;

and S104, fusing the predicted results to obtain the category of the data to be labeled.

At present, due to the particularity of the medical industry, less data and less labeled data are used in the medical named entity recognition task, but the existing available model can only utilize a small amount of labeled data and cannot fully utilize a large amount of unlabeled data, and a single active learning method is usually used, so that the advantages brought by different types of model combinations are not fully utilized.

The method provided by the invention is provided aiming at the particularity of the medical data and the problems in the prior art. The problem of insufficient labeled data is solved by fully utilizing the advantages of massive unlabeled data and multi-model complementation, and the performance of medical named entity identification is improved. Specifically, a small amount of labeled data is used for training to obtain a plurality of NER models of different types, based on the NER models, an active learning method is used for selecting data with the strongest uncertainty in unlabeled data as data to be labeled, then prediction results of the NER models are fused to give a data label, and finally labeled data are added into a training data set to be used for optimizing the models.

The method provided by the invention utilizes a small amount of medical labeled data, adopts a plurality of different active learning strategy combinations, selects the data with the strongest uncertainty of the model from the unlabeled data, gives the data label by fusing the prediction results of a plurality of models, and finally adds the labeled data into the training data set to optimize the model. Finally, the effect of achieving equivalent performance of a large amount of data by using a small amount of data is achieved. Actual use data shows that the method provided by the invention can achieve the performance of about 90% of full data under 10% of labeled data. Therefore, the method of the invention well meets the actual requirements of the information extraction application scene under the condition that the medical scene lacks enough labeling information.

In step S101, initially, since the labeled data in the medical field is less, the labeled data in the labeled data set is less, but with the implementation of the method, after the category of the unlabeled data is obtained, the data may be labeled and added to the labeled data set, so that the labeled data therein is more and more, and the trained model has higher and higher performance until the performance is stable.

In the process of training the model, because the amount of training data which can be used is relatively small, in a preferred embodiment of the invention, a pre-training language model + fine-tuning method is adopted to train a plurality of NER models of different types by using a small amount of labeled data.

Wherein the plurality of NER models of different types obtained by training may include: deep learning models, statistical learning models, and/or knowledge-based models. Among these plural NER models, there may be only one type of model, and there may be plural types of models, that is, plural types of models are combined into one NER model. As an example, the plurality of NER models may include, for example: FCRF model, Emb + MLP, Bert + CRF, Bert + BilSTM + CRF, FLAT model, GloalPointer, and Prompt model. Wherein, FCRF is a statistical learning model, Emb + MLP, Bert + CRF, and Bert + BilSTM + CRF are the combination of the statistical learning model and the deep learning model, and FLAT model, GloalPointer, and Prompt are the deep learning models.

The models or the combination of the models of different types adopt a plurality of different active learning strategies, thereby realizing the advantage complementation between the single learning strategies and making up the defect of less training data.

The FCRF model adopts a method combination of features + CRF based on statistical learning. The features based on statistical learning can be selected from common features, such as context window words, vocabulary length and other prior information; emb + MLP can be obtained completely based on existing training data; the Bert + MLP can directly utilize the information of the pre-training language model; the Bert + CRF can better model the input sequence by using the CRF; bert + BiLSTM + CRF can better model context information by using the BiLSTM; the FLAT model may utilize location information to better model information of lexical context; the GlobalPointer model can simultaneously model nested and non-nested named entities; the Prompt model may utilize PLM to convert NER to a production question, modeling named entity recognition from a text production perspective. The models obtained by combining various different active learning strategies can be trained from different sides by utilizing training data to obtain information except the data, entities with different types and lengths, and the like. Thus, these NER models: the FCRF model, Emb + MLP, Bert + CRF, Bert + BilSTM + CRF, FLAT model, GloalPointer and Prompt model also have good complementarity. Therefore, based on a plurality of NER models of different types, when the data to be labeled is selected from the unlabeled data by an active learning method, the most valuable data to be labeled can be determined by utilizing the advantage complementation among the models.

In step S102, the selecting data to be labeled from unlabeled data by using an active learning method based on the plurality of NER models may include the following steps:

Because each NER model obtained by training is different in distribution of the unlabeled data among different classes, in order to find the unlabeled data with the highest labeling value, in the invention, the consistency of the distribution of the same unlabeled data among different classes by a plurality of models is calculated for judgment, and when the distribution consistency of different models is lower, the uncertainty of the unlabeled data is higher, and the labeling value is higher.

The method comprises the steps of firstly, predicting the probability that certain unlabeled data is in a certain class by using a certain NER model, then sequentially predicting the probability that the certain unlabeled data is in the certain class by using other NER models to obtain probability distributions which are predicted by using the NER models to the certain class respectively, and finally calculating the consistency of the probability distributions to obtain the consistency of the probability distributions of the certain unlabeled data in the classes. And in the same way, obtaining the consistency of the probability distribution of other unlabeled data in each category. And finally, taking the unmarked data with the lowest consistency as the most valuable data to be marked from all consistencies.

Wherein, as an embodiment, for example, the plurality of NER models may include: at least two of FCRF model, Emb + MLP, Bert + CRF, Bert + BilSTM + CRF, FLAT model, GloalPointer, and Prompt model.

In a preferred embodiment of the present invention, the consistency of the distribution is calculated based on the KL distance. The KL distance is an abbreviation for the Kullback-Leibler difference (Kullback-Leibler bias), also called Relative Entropy (Relative Entropy). It measures the difference between two probability distributions in the same event space. Therefore, the greater the KL distance, the lower the consistency.

In a preferred embodiment of the present invention, the calculating the consistency of the distribution of each unlabeled data in each category; determining data to be marked from all unmarked data according to the consistency, and adopting the following formula:

in the formula (I), the compound is shown in the specification,

in order to not label the data,

is the M-th entity class, M is the total number of entity classes,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

I.e. for each data

Calculating different model prediction data

Between probabilities for the m-th class

Distance, to all

Averaging after calculation of the individual entity classes, argmax_xIndicating the data at which the subsequent function takes the maximum value, i.e.

Data of maximum distance

。

In step S103, the categories of the data to be labeled are predicted by using a plurality of NER models, and how many prediction results are obtained by using how many NER models. For example, in a preferred embodiment of the present invention, the plurality of NER models may include: 8 prediction results can be obtained by 8 models including an FCRF model, an Emb + MLP model, a Bert + CRF model, a Bert + BilSTM + CRF model, a FLAT model, a GloalPointer model and a Prompt model.

In another preferred embodiment of the present invention, a dictionary + RULE based method (RULE) is additionally introduced, which determines names and categories of entities by a dictionary retrieval and text similarity calculation method.

In step S104, after obtaining a plurality of prediction results corresponding to the plurality of models, the prediction results of all models (FCRF model, Emb + MLP, Bert + CRF, Bert + BiLSTM + CRF, FLAT model, GloalPointer, Prompt model, RULE) are fused by using the concept of ensemble learning.

In a preferred embodiment of the present invention, the predicted result may be fused by using the following formula:

in the formula (I), the compound is shown in the specification,

for unlabelled data

In the final category of the video data to be displayed,

the number of the NER models is,

is as follows

The number of the NER models is determined,

for the m-th entity class,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The weights of the individual NER models are such that,

are learnable parameters. argmax_cIndicating the class of the function when the subsequent function takes the maximum value.

In the invention, the fusion result is used as the category of the data to be labeled. Furthermore, the class can be used for labeling data to be labeled, the labeled data is added into the labeled data set, the data set added with new labeled data is used as a training set for iterative training of a plurality of NER models of different types, and the performance of the NER models is stable and can not be improved any more.

Example two

As shown in fig. 3, an embodiment of the present invention provides a medical named entity identification method, including:

inputting data into a plurality of named entity recognition NER models to obtain a plurality of recognition results; a plurality of the NER models are trained according to the following method provided in example one:

marking data to be marked by utilizing a result obtained by fusing a plurality of NER models in a prediction mode, adding the marked data into the marked data set, and iteratively training a plurality of NER models of different types by taking the data set added with new marked data as a training set until the performance of the NER models is stable and is not promoted any more.

Specifically, the method as described in the first embodiment may be adopted to fuse a plurality of recognition results obtained by using a plurality of NER models to obtain a final entity recognition result. Specifically, the following formula can be adopted:

in the formula (I), the compound is shown in the specification,

for unlabelled data

In the final category of the video data to be displayed,

the number of the NER models is,

is as follows

The number of the NER models is determined,

for the m-th entity class,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The weights of the individual NER models are such that,

EXAMPLE III

As shown in fig. 4, another aspect of the present invention further includes a functional module architecture completely corresponding to the foregoing method flow, that is, an embodiment of the present invention further provides a medical named entity recognition apparatus, including:

the NER model training module 401 is used for training a plurality of named entity recognition NER models of different types by using the labeling data set;

a to-be-labeled data selection module 402, configured to select, based on the plurality of NER models, data to be labeled from unlabeled data by using an active learning method;

a to-be-labeled data category prediction module 403, configured to use the plurality of NER models to respectively predict categories of the to-be-labeled data;

and a prediction result fusion module 404, configured to fuse the prediction results to obtain the category of the data to be labeled.

Wherein, in the NER model training module, the plurality of NER models of different types include: deep learning models, statistical learning models, and/or knowledge-based models.

In the to-be-labeled data selection module, selecting, based on the plurality of NER models, to-be-labeled data from unlabeled data by using an active learning method includes:

Calculating the distribution consistency of the unlabeled data in each category; determining data to be marked from all unmarked data according to the consistency, and adopting the following formula:

in the formula (I), the compound is shown in the specification,

in order to not label the data,

for the m-th entity class,

is the total amount of the entity class,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

Probability of m-th class, D being two distributions

The distance between the first and second electrodes,

for all the unmarked data obtained finally

The data with the largest distance.

In the prediction result fusion module, the predicted result is fused by using the following formula:

in the formula (I), the compound is shown in the specification,

for unlabelled data

In the final category of the video data to be displayed,

the number of the NER models is,

is as follows

The number of the NER models is determined,

for the m-th entity class,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The weights of the individual NER models are such that,

The medical named entity recognition device provided by the embodiment of the invention further comprises a model optimization module, wherein the model optimization module is used for labeling the data to be labeled by utilizing the obtained categories, adding the data to be labeled into the labeled data set, and training a plurality of NER models in an iterative manner until the performance of the NER models is stable.

The device can be implemented by the medical named entity identification method provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.

The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.

The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A medical named entity recognition method, comprising:

2. The medical named entity recognition method of claim 1, wherein the plurality of named entity recognition NER models of different types comprises: deep learning models, statistical learning models, and/or knowledge-based models.

3. The medical named entity recognition method of claim 1, wherein selecting data to be labeled from unlabeled data using an active learning method based on a plurality of the NER models comprises:

4. The medical named entity recognition method according to claim 3, wherein the consistency of distribution of each unlabeled data in each category is calculated, and the data to be labeled is determined from all the unlabeled data according to the consistency by using the following formula:

in the formula (I), the compound is shown in the specification,

in order to not label the data,

for the m-th entity class,

is the total amount of the entity class,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

Is the probability of the mth category,

is as follows

The number of the NER models is determined,

is as follows

Predicted by NER model

Probability of m-th class, D being two distributions

The distance between the first and second electrodes,

for all the unmarked data obtained finally

The data with the largest distance.

5. The medical named entity recognition method of claim 1, wherein the fusing of the predicted results to obtain the category of the data to be labeled employs the following formula:

in the formula (I), the compound is shown in the specification,

for unlabelled data

In the final category of the video data to be displayed,

the number of the NER models is,

is as follows

The number of the NER models is determined,

for the m-th entity class,

is as follows

Predicted by NER model

Is the probability of the mth category,

is the weight of the ith NER model,

are learnable parameters.

6. The medical named entity recognition method of claim 1, further comprising the steps of:

7. A medical named entity recognition method, comprising:

inputting data into a plurality of named entity recognition NER models to obtain a plurality of recognition results; a plurality of the NER models are trained according to the method of claim 6;

8. A medical named entity recognition apparatus, comprising:

9. A memory storing a plurality of instructions for implementing the method of any one of claims 1-7.

10. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method according to any of claims 1-7.