CN114169338A - Medical named entity identification method and device and electronic equipment - Google Patents

Medical named entity identification method and device and electronic equipment Download PDF

Info

Publication number
CN114169338A
CN114169338A CN202210125810.4A CN202210125810A CN114169338A CN 114169338 A CN114169338 A CN 114169338A CN 202210125810 A CN202210125810 A CN 202210125810A CN 114169338 A CN114169338 A CN 114169338A
Authority
CN
China
Prior art keywords
data
ner
labeled
models
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210125810.4A
Other languages
Chinese (zh)
Other versions
CN114169338B (en
Inventor
安波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Beijing Zhiyuan Artificial Intelligence Research Institute
Priority to CN202210125810.4A priority Critical patent/CN114169338B/en
Publication of CN114169338A publication Critical patent/CN114169338A/en
Application granted granted Critical
Publication of CN114169338B publication Critical patent/CN114169338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a medical named entity identification method, a medical named entity identification device and electronic equipment. The method comprises the following steps: training by utilizing a labeling data set to obtain a plurality of named entity recognition NER models of different types; selecting data to be labeled from unlabeled data by using an active learning method based on a plurality of NER models; predicting the category of the data to be labeled by utilizing a plurality of NER models respectively; and fusing the predicted results to obtain the category of the data to be labeled. The technical scheme realizes the effect of achieving equivalent performance of a large amount of data by using a small amount of data. Actual use data shows that the method provided by the invention can achieve the performance of about 90% of full data under 10% of labeled data. Therefore, the method of the invention well meets the actual requirements of the information extraction application scene under the condition that the medical scene lacks enough labeling information.

Description

Medical named entity identification method and device and electronic equipment
Technical Field
The invention relates to the technical field of medical data processing, in particular to a medical named entity identification method and device and electronic equipment.
Background
Named Entity Recognition (NER) in the medical field is a foundation for constructing medical knowledge maps and medical big data and is an important foundation for realizing intelligent analysis of cases and medical intellectualization.
At present, medical NER tasks are mainly realized by applying deep learning technology. In the application process of the deep learning technology, a large amount of labeled data is needed to train the model. Medical data is scarce due to privacy and sensitivity of the medical data, and data labeled for recognition by named entities is scarce. Therefore, the deep learning technology meets a great bottleneck on the medical NER task, and the medical NER task under the condition of a small amount of labeled data cannot be met.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme.
The invention provides a medical named entity identification method on one hand, which comprises the following steps:
training by utilizing a labeling data set to obtain a plurality of named entity recognition NER models of different types;
selecting data to be labeled from unlabeled data by using an active learning method based on a plurality of NER models;
predicting the category of the data to be labeled by utilizing a plurality of NER models respectively;
and fusing the predicted results to obtain the category of the data to be labeled.
Preferably, the plurality of named entity recognition NER models of different types comprises: deep learning models, statistical learning models, and/or knowledge-based models.
Preferably, the selecting data to be labeled from unlabeled data by using an active learning method based on the plurality of NER models includes:
respectively predicting the distribution of the unlabeled data in each category by using each NER model;
calculating the distribution consistency of the unlabeled data in each category;
and determining the data to be labeled from all the unlabeled data according to the consistency.
Preferably, the consistency of the distribution of each unlabeled data in each category is calculated; determining data to be marked from all unmarked data according to the consistency, and adopting the following formula:
Figure 194187DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 816316DEST_PATH_IMAGE002
in order to not label the data,
Figure 145667DEST_PATH_IMAGE003
is the M-th entity class, M is the total number of entity classes,
Figure 203621DEST_PATH_IMAGE004
is as follows
Figure 669238DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 131443DEST_PATH_IMAGE006
is as follows
Figure 479248DEST_PATH_IMAGE005
Predicted by NER model
Figure 216260DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 536383DEST_PATH_IMAGE007
is as follows
Figure 294123DEST_PATH_IMAGE008
The number of the NER models is determined,
Figure 4590DEST_PATH_IMAGE009
) Is as follows
Figure 138768DEST_PATH_IMAGE008
Predicted by NER model
Figure 985501DEST_PATH_IMAGE002
Is the probability of the mth category, D is the KL distance of the two distributions,
Figure 182652DEST_PATH_IMAGE010
the data with the largest KL distance in all the finally obtained unlabeled data is obtained.
Preferably, the predicted result is fused to obtain the category of the data to be labeled, and the following formula is adopted:
Figure 380415DEST_PATH_IMAGE011
in the formula (I), the compound is shown in the specification,
Figure 787126DEST_PATH_IMAGE012
for unlabelled data
Figure 81841DEST_PATH_IMAGE002
In the final category of the video data to be displayed,
Figure 915805DEST_PATH_IMAGE013
the number of the NER models is,
Figure 991077DEST_PATH_IMAGE004
is as follows
Figure 670320DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 616279DEST_PATH_IMAGE014
for the m-th entity class,
Figure 293248DEST_PATH_IMAGE006
is as follows
Figure 186642DEST_PATH_IMAGE005
Predicted by NER model
Figure 935156DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 470042DEST_PATH_IMAGE015
is as follows
Figure 52333DEST_PATH_IMAGE005
The weights of the individual NER models are such that,
Figure 39881DEST_PATH_IMAGE015
are learnable parameters.
Preferably, the method further comprises the steps of:
and labeling the data to be labeled by using the obtained categories, adding the data to be labeled into the labeled data set, and iteratively training a plurality of NER models.
The invention provides a medical named entity identification method in a second aspect, which comprises the following steps:
inputting data into a plurality of named entity recognition NER models to obtain a plurality of recognition results; a plurality of NER models are obtained by training according to the method;
and fusing the plurality of identification results to obtain a final entity identification result.
A third aspect of the present invention provides a medical named entity recognition apparatus, comprising:
the NER model training module is used for training a plurality of named entity recognition NER models of different types by utilizing the labeling data set;
the to-be-labeled data selection module is used for selecting data to be labeled from the unlabeled data by utilizing an active learning method based on the NER models;
the data to be labeled category prediction module is used for predicting the category of the data to be labeled by utilizing the NER models respectively;
and the prediction result fusion module is used for fusing the prediction result to obtain the category of the data to be labeled.
The invention also provides a memory storing a plurality of instructions for implementing the method as described above.
The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions which can be loaded and executed by the processor to enable the processor to execute the method.
The invention has the beneficial effects that: according to the technical scheme provided by the invention, a plurality of NER models are obtained by utilizing a small amount of medical labeling data for training, data with the strongest model uncertainty in unlabeled data are selected by utilizing an active learning method based on the NER models, the data labels are given by fusing the prediction results of the NER models, and finally the labeled data are added into a training data set to optimize the models. Finally, the effect of achieving equivalent performance of a large amount of data by using a small amount of data is achieved. Actual use data shows that the method provided by the invention can achieve the performance of about 90% of full data under 10% of labeled data. Therefore, the method of the invention well meets the actual requirements of the information extraction application scene under the condition that the medical scene lacks enough labeling information.
Drawings
FIG. 1 is a schematic flow chart of a medical named entity recognition method according to the present invention;
FIG. 2 is a schematic diagram of an exemplary implementation of the medical named entity recognition method according to the invention;
FIG. 3 is a schematic view illustrating a process of identifying a named entity in unlabeled data according to the present invention;
fig. 4 is a functional structure diagram of the medical named entity recognition device according to the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.
A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.
The display screen is used for displaying user interfaces of all the application programs.
In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.
Example one
As shown in fig. 1-2, an embodiment of the present invention provides a medical named entity identification method, including:
s101, training by using a labeling data set to obtain a plurality of named entity recognition NER models of different types;
s102, selecting data to be labeled from unlabeled data by using an active learning method based on the NER models;
s103, predicting the category of the data to be labeled by utilizing the NER models respectively;
and S104, fusing the predicted results to obtain the category of the data to be labeled.
At present, due to the particularity of the medical industry, less data and less labeled data are used in the medical named entity recognition task, but the existing available model can only utilize a small amount of labeled data and cannot fully utilize a large amount of unlabeled data, and a single active learning method is usually used, so that the advantages brought by different types of model combinations are not fully utilized.
The method provided by the invention is provided aiming at the particularity of the medical data and the problems in the prior art. The problem of insufficient labeled data is solved by fully utilizing the advantages of massive unlabeled data and multi-model complementation, and the performance of medical named entity identification is improved. Specifically, a small amount of labeled data is used for training to obtain a plurality of NER models of different types, based on the NER models, an active learning method is used for selecting data with the strongest uncertainty in unlabeled data as data to be labeled, then prediction results of the NER models are fused to give a data label, and finally labeled data are added into a training data set to be used for optimizing the models.
The method provided by the invention utilizes a small amount of medical labeled data, adopts a plurality of different active learning strategy combinations, selects the data with the strongest uncertainty of the model from the unlabeled data, gives the data label by fusing the prediction results of a plurality of models, and finally adds the labeled data into the training data set to optimize the model. Finally, the effect of achieving equivalent performance of a large amount of data by using a small amount of data is achieved. Actual use data shows that the method provided by the invention can achieve the performance of about 90% of full data under 10% of labeled data. Therefore, the method of the invention well meets the actual requirements of the information extraction application scene under the condition that the medical scene lacks enough labeling information.
In step S101, initially, since the labeled data in the medical field is less, the labeled data in the labeled data set is less, but with the implementation of the method, after the category of the unlabeled data is obtained, the data may be labeled and added to the labeled data set, so that the labeled data therein is more and more, and the trained model has higher and higher performance until the performance is stable.
In the process of training the model, because the amount of training data which can be used is relatively small, in a preferred embodiment of the invention, a pre-training language model + fine-tuning method is adopted to train a plurality of NER models of different types by using a small amount of labeled data.
Wherein the plurality of NER models of different types obtained by training may include: deep learning models, statistical learning models, and/or knowledge-based models. Among these plural NER models, there may be only one type of model, and there may be plural types of models, that is, plural types of models are combined into one NER model. As an example, the plurality of NER models may include, for example: FCRF model, Emb + MLP, Bert + CRF, Bert + BilSTM + CRF, FLAT model, GloalPointer, and Prompt model. Wherein, FCRF is a statistical learning model, Emb + MLP, Bert + CRF, and Bert + BilSTM + CRF are the combination of the statistical learning model and the deep learning model, and FLAT model, GloalPointer, and Prompt are the deep learning models.
The models or the combination of the models of different types adopt a plurality of different active learning strategies, thereby realizing the advantage complementation between the single learning strategies and making up the defect of less training data.
The FCRF model adopts a method combination of features + CRF based on statistical learning. The features based on statistical learning can be selected from common features, such as context window words, vocabulary length and other prior information; emb + MLP can be obtained completely based on existing training data; the Bert + MLP can directly utilize the information of the pre-training language model; the Bert + CRF can better model the input sequence by using the CRF; bert + BiLSTM + CRF can better model context information by using the BiLSTM; the FLAT model may utilize location information to better model information of lexical context; the GlobalPointer model can simultaneously model nested and non-nested named entities; the Prompt model may utilize PLM to convert NER to a production question, modeling named entity recognition from a text production perspective. The models obtained by combining various different active learning strategies can be trained from different sides by utilizing training data to obtain information except the data, entities with different types and lengths, and the like. Thus, these NER models: the FCRF model, Emb + MLP, Bert + CRF, Bert + BilSTM + CRF, FLAT model, GloalPointer and Prompt model also have good complementarity. Therefore, based on a plurality of NER models of different types, when the data to be labeled is selected from the unlabeled data by an active learning method, the most valuable data to be labeled can be determined by utilizing the advantage complementation among the models.
In step S102, the selecting data to be labeled from unlabeled data by using an active learning method based on the plurality of NER models may include the following steps:
respectively predicting the distribution of the unlabeled data in each category by using each NER model;
calculating the distribution consistency of the unlabeled data in each category;
and determining the data to be labeled from all the unlabeled data according to the consistency.
Because each NER model obtained by training is different in distribution of the unlabeled data among different classes, in order to find the unlabeled data with the highest labeling value, in the invention, the consistency of the distribution of the same unlabeled data among different classes by a plurality of models is calculated for judgment, and when the distribution consistency of different models is lower, the uncertainty of the unlabeled data is higher, and the labeling value is higher.
The method comprises the steps of firstly, predicting the probability that certain unlabeled data is in a certain class by using a certain NER model, then sequentially predicting the probability that the certain unlabeled data is in the certain class by using other NER models to obtain probability distributions which are predicted by using the NER models to the certain class respectively, and finally calculating the consistency of the probability distributions to obtain the consistency of the probability distributions of the certain unlabeled data in the classes. And in the same way, obtaining the consistency of the probability distribution of other unlabeled data in each category. And finally, taking the unmarked data with the lowest consistency as the most valuable data to be marked from all consistencies.
Wherein, as an embodiment, for example, the plurality of NER models may include: at least two of FCRF model, Emb + MLP, Bert + CRF, Bert + BilSTM + CRF, FLAT model, GloalPointer, and Prompt model.
In a preferred embodiment of the present invention, the consistency of the distribution is calculated based on the KL distance. The KL distance is an abbreviation for the Kullback-Leibler difference (Kullback-Leibler bias), also called Relative Entropy (Relative Entropy). It measures the difference between two probability distributions in the same event space. Therefore, the greater the KL distance, the lower the consistency.
In a preferred embodiment of the present invention, the calculating the consistency of the distribution of each unlabeled data in each category; determining data to be marked from all unmarked data according to the consistency, and adopting the following formula:
Figure 592085DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 653582DEST_PATH_IMAGE002
in order to not label the data,
Figure 265829DEST_PATH_IMAGE014
is the M-th entity class, M is the total number of entity classes,
Figure 147197DEST_PATH_IMAGE004
is as follows
Figure 783320DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 433744DEST_PATH_IMAGE006
is as follows
Figure 482471DEST_PATH_IMAGE005
Predicted by NER model
Figure 913453DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 948405DEST_PATH_IMAGE007
is as follows
Figure 843548DEST_PATH_IMAGE008
The number of the NER models is determined,
Figure 204123DEST_PATH_IMAGE009
is as follows
Figure 374597DEST_PATH_IMAGE008
Predicted by NER model
Figure 72295DEST_PATH_IMAGE002
Is the probability of the mth category, D is the KL distance of the two distributions,
Figure 25207DEST_PATH_IMAGE010
the data with the largest KL distance in all the finally obtained unlabeled data is obtained.
I.e. for each data
Figure 681317DEST_PATH_IMAGE002
Calculating different model prediction data
Figure 637290DEST_PATH_IMAGE002
Between probabilities for the m-th class
Figure 138679DEST_PATH_IMAGE016
Distance, to all
Figure 211677DEST_PATH_IMAGE017
Averaging after calculation of the individual entity classes, argmaxxIndicating the data at which the subsequent function takes the maximum value, i.e.
Figure 976370DEST_PATH_IMAGE016
Data of maximum distance
Figure 134819DEST_PATH_IMAGE010
In step S103, the categories of the data to be labeled are predicted by using a plurality of NER models, and how many prediction results are obtained by using how many NER models. For example, in a preferred embodiment of the present invention, the plurality of NER models may include: 8 prediction results can be obtained by 8 models including an FCRF model, an Emb + MLP model, a Bert + CRF model, a Bert + BilSTM + CRF model, a FLAT model, a GloalPointer model and a Prompt model.
In another preferred embodiment of the present invention, a dictionary + RULE based method (RULE) is additionally introduced, which determines names and categories of entities by a dictionary retrieval and text similarity calculation method.
In step S104, after obtaining a plurality of prediction results corresponding to the plurality of models, the prediction results of all models (FCRF model, Emb + MLP, Bert + CRF, Bert + BiLSTM + CRF, FLAT model, GloalPointer, Prompt model, RULE) are fused by using the concept of ensemble learning.
In a preferred embodiment of the present invention, the predicted result may be fused by using the following formula:
Figure 580844DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure 39507DEST_PATH_IMAGE012
for unlabelled data
Figure 647206DEST_PATH_IMAGE002
In the final category of the video data to be displayed,
Figure 292951DEST_PATH_IMAGE013
the number of the NER models is,
Figure 392932DEST_PATH_IMAGE004
is as follows
Figure 581468DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 484702DEST_PATH_IMAGE014
for the m-th entity class,
Figure 352164DEST_PATH_IMAGE006
is as follows
Figure 874412DEST_PATH_IMAGE005
Predicted by NER model
Figure 307668DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 257169DEST_PATH_IMAGE015
is as follows
Figure 582233DEST_PATH_IMAGE005
The weights of the individual NER models are such that,
Figure 236069DEST_PATH_IMAGE015
are learnable parameters. argmaxcIndicating the class of the function when the subsequent function takes the maximum value.
In the invention, the fusion result is used as the category of the data to be labeled. Furthermore, the class can be used for labeling data to be labeled, the labeled data is added into the labeled data set, the data set added with new labeled data is used as a training set for iterative training of a plurality of NER models of different types, and the performance of the NER models is stable and can not be improved any more.
Example two
As shown in fig. 3, an embodiment of the present invention provides a medical named entity identification method, including:
inputting data into a plurality of named entity recognition NER models to obtain a plurality of recognition results; a plurality of the NER models are trained according to the following method provided in example one:
marking data to be marked by utilizing a result obtained by fusing a plurality of NER models in a prediction mode, adding the marked data into the marked data set, and iteratively training a plurality of NER models of different types by taking the data set added with new marked data as a training set until the performance of the NER models is stable and is not promoted any more.
And fusing the plurality of identification results to obtain a final entity identification result.
Specifically, the method as described in the first embodiment may be adopted to fuse a plurality of recognition results obtained by using a plurality of NER models to obtain a final entity recognition result. Specifically, the following formula can be adopted:
Figure 992672DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure 972129DEST_PATH_IMAGE012
for unlabelled data
Figure 394277DEST_PATH_IMAGE002
In the final category of the video data to be displayed,
Figure 258328DEST_PATH_IMAGE013
the number of the NER models is,
Figure 931755DEST_PATH_IMAGE004
is as follows
Figure 550955DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 52210DEST_PATH_IMAGE003
for the m-th entity class,
Figure 906903DEST_PATH_IMAGE006
is as follows
Figure 106940DEST_PATH_IMAGE005
Predicted by NER model
Figure 959358DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 244846DEST_PATH_IMAGE015
is as follows
Figure 578263DEST_PATH_IMAGE005
The weights of the individual NER models are such that,
Figure 898386DEST_PATH_IMAGE015
are learnable parameters. argmaxcIndicating the class of the function when the subsequent function takes the maximum value.
EXAMPLE III
As shown in fig. 4, another aspect of the present invention further includes a functional module architecture completely corresponding to the foregoing method flow, that is, an embodiment of the present invention further provides a medical named entity recognition apparatus, including:
the NER model training module 401 is used for training a plurality of named entity recognition NER models of different types by using the labeling data set;
a to-be-labeled data selection module 402, configured to select, based on the plurality of NER models, data to be labeled from unlabeled data by using an active learning method;
a to-be-labeled data category prediction module 403, configured to use the plurality of NER models to respectively predict categories of the to-be-labeled data;
and a prediction result fusion module 404, configured to fuse the prediction results to obtain the category of the data to be labeled.
Wherein, in the NER model training module, the plurality of NER models of different types include: deep learning models, statistical learning models, and/or knowledge-based models.
In the to-be-labeled data selection module, selecting, based on the plurality of NER models, to-be-labeled data from unlabeled data by using an active learning method includes:
respectively predicting the distribution of the unlabeled data in each category by using each NER model;
calculating the distribution consistency of the unlabeled data in each category;
and determining the data to be labeled from all the unlabeled data according to the consistency.
Calculating the distribution consistency of the unlabeled data in each category; determining data to be marked from all unmarked data according to the consistency, and adopting the following formula:
Figure 797072DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 615861DEST_PATH_IMAGE002
in order to not label the data,
Figure 284127DEST_PATH_IMAGE014
for the m-th entity class,
Figure 662019DEST_PATH_IMAGE019
is the total amount of the entity class,
Figure 59502DEST_PATH_IMAGE004
is as follows
Figure 310793DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 186345DEST_PATH_IMAGE006
is as follows
Figure 418744DEST_PATH_IMAGE005
Predicted by NER model
Figure 783866DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 796821DEST_PATH_IMAGE007
is as follows
Figure 148168DEST_PATH_IMAGE008
The number of the NER models is determined,
Figure 297390DEST_PATH_IMAGE009
is as follows
Figure 833413DEST_PATH_IMAGE008
Predicted by NER model
Figure 333665DEST_PATH_IMAGE002
Probability of m-th class, D being two distributions
Figure 488703DEST_PATH_IMAGE016
The distance between the first and second electrodes,
Figure 557677DEST_PATH_IMAGE010
for all the unmarked data obtained finally
Figure 139968DEST_PATH_IMAGE016
The data with the largest distance.
In the prediction result fusion module, the predicted result is fused by using the following formula:
Figure 376783DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure 194567DEST_PATH_IMAGE012
for unlabelled data
Figure 52801DEST_PATH_IMAGE002
In the final category of the video data to be displayed,
Figure 805994DEST_PATH_IMAGE013
the number of the NER models is,
Figure 408401DEST_PATH_IMAGE004
is as follows
Figure 295454DEST_PATH_IMAGE005
The number of the NER models is determined,
Figure 742616DEST_PATH_IMAGE003
for the m-th entity class,
Figure 791344DEST_PATH_IMAGE006
is as follows
Figure 753483DEST_PATH_IMAGE005
Predicted by NER model
Figure 116332DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 886841DEST_PATH_IMAGE015
is as follows
Figure 843821DEST_PATH_IMAGE005
The weights of the individual NER models are such that,
Figure 806440DEST_PATH_IMAGE015
are learnable parameters. argmaxcIndicating the class of the function when the subsequent function takes the maximum value.
The medical named entity recognition device provided by the embodiment of the invention further comprises a model optimization module, wherein the model optimization module is used for labeling the data to be labeled by utilizing the obtained categories, adding the data to be labeled into the labeled data set, and training a plurality of NER models in an iterative manner until the performance of the NER models is stable.
The device can be implemented by the medical named entity identification method provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.
The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.
The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A medical named entity recognition method, comprising:
training by utilizing a labeling data set to obtain a plurality of named entity recognition NER models of different types;
selecting data to be labeled from unlabeled data by using an active learning method based on a plurality of NER models;
predicting the category of the data to be labeled by utilizing a plurality of NER models respectively;
and fusing the predicted results to obtain the category of the data to be labeled.
2. The medical named entity recognition method of claim 1, wherein the plurality of named entity recognition NER models of different types comprises: deep learning models, statistical learning models, and/or knowledge-based models.
3. The medical named entity recognition method of claim 1, wherein selecting data to be labeled from unlabeled data using an active learning method based on a plurality of the NER models comprises:
respectively predicting the distribution of the unlabeled data in each category by using each NER model;
calculating the distribution consistency of the unlabeled data in each category;
and determining the data to be labeled from all the unlabeled data according to the consistency.
4. The medical named entity recognition method according to claim 3, wherein the consistency of distribution of each unlabeled data in each category is calculated, and the data to be labeled is determined from all the unlabeled data according to the consistency by using the following formula:
Figure 204754DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 653053DEST_PATH_IMAGE002
in order to not label the data,
Figure 839315DEST_PATH_IMAGE003
for the m-th entity class,
Figure 439053DEST_PATH_IMAGE004
is the total amount of the entity class,
Figure 784584DEST_PATH_IMAGE005
is as follows
Figure 13571DEST_PATH_IMAGE006
The number of the NER models is determined,
Figure 811762DEST_PATH_IMAGE007
is as follows
Figure 463192DEST_PATH_IMAGE006
Predicted by NER model
Figure 538596DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 63118DEST_PATH_IMAGE008
is as follows
Figure 473240DEST_PATH_IMAGE009
The number of the NER models is determined,
Figure 741410DEST_PATH_IMAGE010
is as follows
Figure 671320DEST_PATH_IMAGE009
Predicted by NER model
Figure 366743DEST_PATH_IMAGE002
Probability of m-th class, D being two distributions
Figure 264161DEST_PATH_IMAGE011
The distance between the first and second electrodes,
Figure 70443DEST_PATH_IMAGE012
for all the unmarked data obtained finally
Figure 120439DEST_PATH_IMAGE011
The data with the largest distance.
5. The medical named entity recognition method of claim 1, wherein the fusing of the predicted results to obtain the category of the data to be labeled employs the following formula:
Figure 986764DEST_PATH_IMAGE013
in the formula (I), the compound is shown in the specification,
Figure 168215DEST_PATH_IMAGE014
for unlabelled data
Figure 653554DEST_PATH_IMAGE002
In the final category of the video data to be displayed,
Figure 682690DEST_PATH_IMAGE015
the number of the NER models is,
Figure 110129DEST_PATH_IMAGE005
is as follows
Figure 591926DEST_PATH_IMAGE006
The number of the NER models is determined,
Figure 880956DEST_PATH_IMAGE003
for the m-th entity class,
Figure 499019DEST_PATH_IMAGE007
is as follows
Figure 628518DEST_PATH_IMAGE006
Predicted by NER model
Figure 472977DEST_PATH_IMAGE002
Is the probability of the mth category,
Figure 424753DEST_PATH_IMAGE016
is the weight of the ith NER model,
Figure 553115DEST_PATH_IMAGE016
are learnable parameters.
6. The medical named entity recognition method of claim 1, further comprising the steps of:
and labeling the data to be labeled by using the obtained categories, adding the data to be labeled into the labeled data set, and iteratively training a plurality of NER models.
7. A medical named entity recognition method, comprising:
inputting data into a plurality of named entity recognition NER models to obtain a plurality of recognition results; a plurality of the NER models are trained according to the method of claim 6;
and fusing the plurality of identification results to obtain a final entity identification result.
8. A medical named entity recognition apparatus, comprising:
the NER model training module is used for training a plurality of named entity recognition NER models of different types by utilizing the labeling data set;
the to-be-labeled data selection module is used for selecting data to be labeled from the unlabeled data by utilizing an active learning method based on the NER models;
the data to be labeled category prediction module is used for predicting the category of the data to be labeled by utilizing the NER models respectively;
and the prediction result fusion module is used for fusing the prediction result to obtain the category of the data to be labeled.
9. A memory storing a plurality of instructions for implementing the method of any one of claims 1-7.
10. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method according to any of claims 1-7.
CN202210125810.4A 2022-02-10 2022-02-10 Medical named entity identification method and device and electronic equipment Active CN114169338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210125810.4A CN114169338B (en) 2022-02-10 2022-02-10 Medical named entity identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210125810.4A CN114169338B (en) 2022-02-10 2022-02-10 Medical named entity identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN114169338A true CN114169338A (en) 2022-03-11
CN114169338B CN114169338B (en) 2022-05-17

Family

ID=80489602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210125810.4A Active CN114169338B (en) 2022-02-10 2022-02-10 Medical named entity identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114169338B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN117577348A (en) * 2024-01-15 2024-02-20 中国医学科学院医学信息研究所 Identification method and related device for evidence-based medical evidence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062215A (en) * 2019-12-10 2020-04-24 金蝶软件(中国)有限公司 Named entity recognition method and device based on semi-supervised learning training
CN111797629A (en) * 2020-06-23 2020-10-20 平安医疗健康管理股份有限公司 Medical text data processing method and device, computer equipment and storage medium
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN113343696A (en) * 2021-05-31 2021-09-03 郑州大学第一附属医院 Electronic medical record named entity identification method, device, remote terminal and system
WO2021218024A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Method and apparatus for training named entity recognition model, and computer device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062215A (en) * 2019-12-10 2020-04-24 金蝶软件(中国)有限公司 Named entity recognition method and device based on semi-supervised learning training
WO2021218024A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Method and apparatus for training named entity recognition model, and computer device
CN111797629A (en) * 2020-06-23 2020-10-20 平安医疗健康管理股份有限公司 Medical text data processing method and device, computer equipment and storage medium
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN113343696A (en) * 2021-05-31 2021-09-03 郑州大学第一附属医院 Electronic medical record named entity identification method, device, remote terminal and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾钰婷: "《基于主动学习的中文医学实体识别方法》", 《中国优秀博硕士学位论文全文数据库(硕士) 医药卫生科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN117577348A (en) * 2024-01-15 2024-02-20 中国医学科学院医学信息研究所 Identification method and related device for evidence-based medical evidence
CN117577348B (en) * 2024-01-15 2024-03-29 中国医学科学院医学信息研究所 Identification method and related device for evidence-based medical evidence

Also Published As

Publication number Publication date
CN114169338B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN111581229B (en) SQL statement generation method and device, computer equipment and storage medium
CN114169338B (en) Medical named entity identification method and device and electronic equipment
WO2022022152A1 (en) Video clip positioning method and apparatus, and computer device and storage medium
CN111428021A (en) Text processing method and device based on machine learning, computer equipment and medium
CN110717039A (en) Text classification method and device, electronic equipment and computer-readable storage medium
CN111461301B (en) Serialized data processing method and device, and text processing method and device
US20240046644A1 (en) Video classification method, device and system
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN111653274B (en) Wake-up word recognition method, device and storage medium
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN114647732B (en) Weak supervision-oriented text classification system, method and device
EP4099333A2 (en) Method and apparatus for training compound property pediction model, storage medium and computer program product
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN113065013A (en) Image annotation model training and image annotation method, system, device and medium
CN110909768B (en) Method and device for acquiring marked data
CN112418291A (en) Distillation method, device, equipment and storage medium applied to BERT model
CN113780365A (en) Sample generation method and device
CN112735564A (en) Mental health state prediction method, mental health state prediction apparatus, mental health state prediction medium, and computer program product
CN115129902B (en) Media data processing method, device, equipment and storage medium
CN113688232B (en) Method and device for classifying bid-inviting text, storage medium and terminal
CN115617975A (en) Intention identification method and device for few-sample and multi-turn conversations
CN113010687B (en) Exercise label prediction method and device, storage medium and computer equipment
CN115098722A (en) Text and image matching method and device, electronic equipment and storage medium
CN114627085A (en) Target image identification method and device, storage medium and electronic equipment
CN113987136A (en) Method, device and equipment for correcting text classification label and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant