CN115859984B - Medical named entity recognition model training method, device, equipment and medium - Google Patents

Medical named entity recognition model training method, device, equipment and medium Download PDF

Info

Publication number
CN115859984B
CN115859984B CN202211656436.7A CN202211656436A CN115859984B CN 115859984 B CN115859984 B CN 115859984B CN 202211656436 A CN202211656436 A CN 202211656436A CN 115859984 B CN115859984 B CN 115859984B
Authority
CN
China
Prior art keywords
text data
named entity
training data
medical
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211656436.7A
Other languages
Chinese (zh)
Other versions
CN115859984A (en
Inventor
刘京华
左塞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyong Technology Co ltd
Original Assignee
Beijing Yiyong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyong Technology Co ltd filed Critical Beijing Yiyong Technology Co ltd
Priority to CN202211656436.7A priority Critical patent/CN115859984B/en
Publication of CN115859984A publication Critical patent/CN115859984A/en
Application granted granted Critical
Publication of CN115859984B publication Critical patent/CN115859984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present disclosure provides a semi-supervised medical named entity recognition model training method, apparatus, device, and medium. The method includes constructing a plurality of unlabeled classification training data; inputting the classified training data into a classified model comprising a first coding module and a first loss function module for training; training the classification model a plurality of times based on the result of the first loss function module to optimize a first set of encoding parameters of the first encoding module; generating a small amount of marked training data, wherein the marking is a marking aiming at a medical naming entity in the data; inputting the marked training data into a named entity recognition model comprising a second coding module and a second loss function module for training, wherein the second coding module uses the optimized first coding parameter set as an initial value of a second coding parameter set of the second coding module; and training a named entity recognition model based on the result of the second loss function module using the labeled training data to obtain an optimized second set of encoding parameters.

Description

Medical named entity recognition model training method, device, equipment and medium
Technical Field
The present disclosure relates to the field of data processing, and more particularly, to a semi-supervised based medical named entity recognition model training method, apparatus, device, and medium.
Background
Medical texts typically have medical named entities therein, so that the medical texts can be generally used as core data for constructing a medical information system. In this case, how to structure and normalize the medical text becomes the basis for constructing the medical field data processing.
With the rapid development of big data and artificial intelligence technology and the gradual maturity of related applications, the medical data analysis technology has new progress. Techniques such as medical health information extraction and knowledge discovery are important research directions for data processing in the medical field. For these techniques, it becomes important how to identify medical named entities from medical text data.
Because medical text data generally has multi-source heterogeneous, complex, and massive characteristics, how to quickly and accurately identify medical named entities to medical text data faces the information needs of different clinics and users is a great challenge. Existing medical named entity recognition methods utilize natural language processing applications for the medical field. On the one hand, conventional machine learning models employed in such methods typically require a large amount of manually labeled data to train to perform well. On the other hand, labeling of medical named entity identification is generally more difficult than naming entities in the general field. The main reasons are as follows: (1) The medical field contains a large number of entity concepts, and the entity identification task is heavy; (2) The entity concept has more context constraints, and the same entity word can be different in different context entity types; (3) There may be a great difference in length between entities, and for some disease names and drug names, the length of the entities may be very long, even some entities may contain more than 10 characters, while the existing part of the entities contains only 1 character; (4) there are also cases of inclusion and crossover between entities. Therefore, in the prior art, high labor costs and large amounts of data are required to be annotated to train the machine learning model to achieve better effective medical named entity recognition.
Therefore, a new medical named entity recognition method is needed to solve the above technical problems.
Disclosure of Invention
In view of the above problems, the present disclosure provides a semi-supervised medical named entity recognition model training method, apparatus, device and medium, where a large amount of unlabeled classification training data is used to train a classification model to obtain good classification model parameters, and the classification model parameters are used for a medical named entity recognition model, so that a named entity recognition model with good recognition effect can be obtained under the condition that only a small amount of labeled data is used to train the named entity recognition model, and labor cost is saved.
According to one aspect of the present disclosure, there is provided a medical named entity recognition model training method based on semi-supervised learning, including: acquiring a first text data set, and performing first preprocessing on each text data in the first text data set to construct a first number of unlabeled classification training data; inputting the first quantity of classification training data into a classification model for training to obtain a trained classification model, wherein the classification model comprises a first coding module and a first loss function module; training the classification model using the first amount of classification training data multiple times to optimize a first set of encoding parameters for the first encoding module based on the results of the first loss function module; obtaining a second text data set, performing a second pre-process on each text data in the second text data set to generate a second number of annotated training data, wherein the annotations are annotations of medical named entities for each text data in the second text data set, the first number being greater than the second number; inputting the second quantity of labeled training data into a named entity recognition model for training to obtain a trained named entity recognition model, wherein the named entity recognition model comprises a second coding module and a second loss function module, and the second coding module uses the optimized first coding parameter set as an initial value of a second coding parameter set of the second coding module; and training the named entity recognition model using the second amount of annotated training data multiple times based on the results of the second loss function module to obtain an optimized second set of encoding parameters.
According to some embodiments of the disclosure, the classification model classifies based on whether the classification training data includes a medical named entity.
According to some embodiments of the present disclosure, performing a first pre-process on each text data in the first set of text data to construct a first number of unlabeled classification training data includes: acquiring a medical named entity knowledge conceptual graph, wherein the medical named entity knowledge conceptual graph has a predefined medical named entity; performing maximum text matching on the medical named entity knowledge conceptual graph and each text data in the first text data set to determine whether each text data in the first text data set contains one or more medical named entities; the first number of unlabeled classification training data is constructed based on text data in the first text data set that includes one or more medical named entities.
According to some embodiments of the present disclosure, constructing the first number of unlabeled classification training data based on text data in the first text data set comprising one or more medical named entities comprises: determining, for first text data comprising medical named entities in the first text data set, that the first text data comprises L medical named entities, wherein L is greater than or equal to 1; splitting the first text data to generate L unlabeled classified training data, wherein each classified training data comprises a medical named entity; and constructing a first number of unlabeled classification training data based on each text data in the first set of text data that includes a medical named entity.
According to some embodiments of the disclosure, the result of the first loss function module includes: the first number of unlabeled class training data includes probabilities of medical named entities; and training the classification model using the first amount of classification training data multiple times to optimize a first set of encoding parameters of the first encoding module based on the results of the first loss function module comprises: training the classification model using the first amount of classification training data multiple times to increase the probability; and when the probability exceeds a preset threshold value, taking the current first coding parameter set as the optimized first coding parameter set.
According to some embodiments of the present disclosure, performing a second pre-process on each text data in the second set of text data to generate a second number of annotated training data comprises: dividing each text data in the second text data set into one or more words based on a predetermined word division rule; performing maximum text matching on each text data in the divided second text data set and a medical knowledge database to generate a second number of roughly marked training data; receiving annotation information for the second quantity of coarsely annotated training data, wherein the annotation information includes a particular label for a character in the second quantity of coarsely annotated training data; and generating the second amount of annotated training data based on the annotation information.
According to some embodiments of the disclosure, the specific tag comprises: a first label for each character in the non-medical named entity; a second label for a start character in the medical named entity; a third label for an intermediate character in the medical named entity; and a fourth label for an ending character in the medical named entity.
According to some embodiments of the present disclosure, the second loss function module includes a conditional random field model structure, wherein the random field model structure constrains the output result of the second encoding module based on a predetermined constraint.
According to some embodiments of the disclosure, the predetermined constraint comprises: the start tag of each entity in the training data is constrained to be either a first tag or a second tag; the next tag after the second tag is constrained to be either a third tag or a fourth tag; and the next tag after the third tag is constrained to be either the third tag or the fourth tag.
According to some embodiments of the disclosure, the first encoding module encodes classification training data to output a vectorized representation of the classification training data; and the second encoding module encodes training data to output a vectorized representation of the training data.
According to some embodiments of the disclosure, the first encoding module and the second encoding module comprise the same depth model structure, wherein the depth model structure comprises at least one of BILSTM, LISTM, textCNN, TRANSFORMER or BERT.
According to some embodiments of the disclosure, the total amount of parameters of the first and second encoding parameter sets depends on the dimensions of the input vector of the depth model structure and the dimensions of the hidden layer.
According to some embodiments of the disclosure, when the dimension of the input vector of the depth model structure is n and the dimension of the hidden layer m, the total amount of parameters is 8 (m 2 +2m+mn)。
According to another aspect of the present disclosure, there is provided a medical named entity recognition method based on semi-supervised learning, including: and inputting the unlabeled text data into a trained named entity recognition model to recognize the medical named entity in the text data, wherein the trained named entity recognition model is obtained based on the method.
According to another aspect of the present disclosure, there is provided a medical named entity recognition model training apparatus based on semi-supervised learning, including: a classification training data construction unit configured to acquire a first text data set, perform a first preprocessing on each text data in the first text data set to construct a first number of unlabeled classification training data; a classification model training unit configured to input the first number of classification training data into a classification model for training to obtain a trained classification model, wherein the classification model comprises a first coding module and a first loss function module; a first encoding parameter set optimizing unit configured to train the classification model using the first number of classification training data a plurality of times to optimize a first encoding parameter set of the first encoding module based on a result of the first loss function module; a training data generation unit configured to obtain a second set of text data, perform a second pre-processing on each text data in the second set of text data to generate a second number of annotated training data, wherein the annotations are annotations of medical named entities for each text data in the second set of text data, the first number being greater than the second number; a named entity recognition model training unit configured to input the second amount of annotated training data into a named entity recognition model for training to obtain a trained named entity recognition model, the named entity recognition model comprising a second encoding module and a second loss function module, wherein the second encoding module uses the optimized first encoding parameter set as an initial value of a second encoding parameter set of the second encoding module; and a second encoding parameter set optimizing unit configured to train the named entity recognition model using the second amount of labeled training data a plurality of times based on a result of the second loss function module to obtain an optimized second encoding parameter set.
According to some embodiments of the disclosure, the classification model classifies based on whether the classification training data includes a medical named entity.
According to some embodiments of the disclosure, the classification training data construction unit further comprises: a medical named entity knowledge concept graph acquisition component configured to acquire a medical named entity knowledge concept graph, wherein the medical named entity knowledge concept graph has a predefined medical named entity; a maximum text matching component configured to maximum text match the medical named entity knowledge conceptual graph with each text data in the first text data set to determine whether each text data in the first text data set contains one or more medical named entities; a classification training data construction component configured to construct the first quantity of unlabeled classification training data based on text data in the first text data set comprising one or more medical named entities.
According to some embodiments of the disclosure, the classification training data construction component is further configured to: determining, for first text data comprising medical named entities in the first text data set, that the first text data comprises L medical named entities, wherein L is greater than or equal to 1; splitting the first text data to generate L unlabeled classified training data, wherein each classified training data comprises a medical named entity; and constructing a first number of unlabeled classification training data based on each text data in the first set of text data that includes a medical named entity.
According to some embodiments of the disclosure, the result of the first loss function module includes: the first number of unlabeled class training data includes probabilities of medical named entities; and the first encoding parameter set optimizing unit is further configured to: training the classification model using the first amount of classification training data multiple times to increase the probability; and when the probability exceeds a preset threshold value, taking the current first coding parameter set as the optimized first coding parameter set.
According to some embodiments of the disclosure, the training data generation unit is further configured to: dividing each text data in the second text data set into one or more words based on a predetermined word division rule; performing maximum text matching on each text data in the divided second text data set and a medical knowledge database to generate a second number of roughly marked training data; receiving annotation information for the second quantity of coarsely annotated training data, wherein the annotation information includes a particular label for a character in the second quantity of coarsely annotated training data; and generating the second amount of annotated training data based on the annotation information.
According to some embodiments of the disclosure, the specific tag comprises: a first label for each character in the non-medical named entity; a second label for a start character in the medical named entity; a third label for an intermediate character in the medical named entity; and a fourth label for an ending character in the medical named entity.
According to some embodiments of the present disclosure, the second loss function module includes a conditional random field model structure, wherein the random field model structure constrains the output result of the second encoding module based on a predetermined constraint.
According to some embodiments of the disclosure, the predetermined constraint comprises: the start tag of each entity in the training data is constrained to be either a first tag or a second tag; the next tag after the second tag is constrained to be either a third tag or a fourth tag; and the next tag after the third tag is constrained to be either the third tag or the fourth tag.
According to some embodiments of the disclosure, the first encoding module encodes classification training data to output a vectorized representation of the classification training data; and the second encoding module encodes training data to output a vectorized representation of the training data.
According to some embodiments of the disclosure, the first encoding module and the second encoding module comprise the same depth model structure, wherein the depth model structure comprises at least one of BILSTM, LISTM, textCNN, TRANSFORMER or BERT.
According to some embodiments of the disclosure, the total amount of parameters of the first and second encoding parameter sets depends on the dimensions of the input vector of the depth model structure and the dimensions of the hidden layer.
According to some embodiments of the disclosure, when the dimension of the input vector of the depth model structure is n and the dimension of the hidden layer m, the total amount of parameters is 8 (m 2 +2m+mn)。
According to another aspect of the present disclosure, there is provided a medical named entity recognition device based on semi-supervised learning, including: and the medical named entity recognition unit is configured to input unlabeled text data into a trained named entity recognition model to recognize the medical named entity in the text data, wherein the trained named entity recognition model is obtained based on the method.
According to another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory, wherein the memory has stored therein computer readable code which, when executed by the processor, implements the foregoing method.
According to another aspect of the present disclosure there is provided a non-transitory computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the foregoing method.
Therefore, according to the medical named entity recognition model training method, device, electronic equipment and medium based on semi-supervised learning, a large amount of unlabeled classified training data is constructed, the classified training data is used for training the classified models to obtain a better coding parameter set, and the coding parameter set is further used for a named entity recognition model, so that the named entity recognition model with the coding parameter set from the classified models can be trained by using the training data under the condition that only a small amount of labeled training data is obtained, so that the named entity recognition model with good recognition effect can be obtained, a large amount of labeled training data for the named entity recognition model is avoided, the medical named entity recognition accuracy is ensured, and meanwhile the labor cost for obtaining the labeled training data is saved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some exemplary embodiments of the present disclosure, and that other drawings may be obtained from these drawings by those of ordinary skill in the art without undue effort.
FIG. 1 illustrates a framework diagram of a semi-supervised based medical named entity recognition model training method, according to some embodiments of the present disclosure;
FIG. 2 illustrates a flow chart of a semi-supervised based medical named entity recognition model training method, according to some embodiments of the present disclosure;
FIG. 3 illustrates a schematic diagram of a locally optimal solution and a globally optimal solution according to some embodiments of the present disclosure;
FIG. 4 illustrates a block diagram of a semi-supervised based medical named entity recognition model training apparatus, according to some embodiments of the present disclosure;
FIG. 5 illustrates a block diagram of a classification training data construction unit, according to some embodiments of the present disclosure;
fig. 6 illustrates a block diagram of an electronic device, according to some embodiments of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.
A flowchart is used in this disclosure to describe the steps of a method according to an embodiment of the present disclosure. It should be understood that the steps that follow or before do not have to be performed in exact order. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.
In the description and drawings of the present disclosure, elements are described in the singular or plural form according to an embodiment. However, the singular and plural forms are properly selected for the proposed case only for convenience of explanation and are not intended to limit the present disclosure thereto. Accordingly, the singular may include the plural and the plural may include the singular unless the context clearly indicates otherwise.
The method, the device, the equipment and the medium for identifying the medical named entity based on semi-supervision provided by the present disclosure will be described in detail below with reference to the accompanying drawings.
< first embodiment >
Fig. 1 and 2 illustrate architecture diagrams and flowcharts, respectively, of a semi-supervised based medical named entity recognition model training method, according to some embodiments of the present disclosure. As shown in fig. 1, the architecture of the semi-supervised based medical named entity recognition model of the present application may have at least a data layer, an encoder layer, a loss layer, and an output results layer.
The architecture shown in fig. 1 will be described in detail below in conjunction with fig. 2. First, for the two classification model portion, at step S202, a first text data set S102 may be acquired, and a first preprocessing may be performed on each text data in the first text data set S102 to construct a first amount of unlabeled classification training data S104.
According to some embodiments of the present disclosure, the classification model may classify based on whether the classification training data includes a medical named entity. In one example, the first set of text data is a collection of medical text data, which may include one or more medical text data. These medical text data may be obtained by a patient main index (EMPI), for example, based on the patient's identity information. EMPI refers to providing the same patient with a mutual index between different IDs. According to the identity information of the user, the EMPI is used for acquiring medical data related to the user, so that the safety of privacy of the patient can be ensured.
In one example, since no labeling is required, the first amount of unlabeled class training data S104 is preferably a massive amount of unlabeled class training data, by which a better training effect of the class model can be obtained.
In one example, each text data in the first set of text data may have a text title and text content, as shown in table 1. In the text data, a text title is used as an index of the text data, so that the text data of the same type can be conveniently acquired for training. In another example, the text data may also have text content alone, or other structures that facilitate classification training.
TABLE 1
Based on the example of table 1, performing a first pre-process on the text data in the first text data set S102 to construct a massive amount of unlabeled classification training data S104 may be confirming that the text data shown in table 1 contains the medical named entity "force naive". That is, the unlabeled classified training data means that the corresponding text data contains a specific medical named entity.
In another example, for the term "breast cancer," it encompasses the medical named entity "breast". Since it is only necessary to identify whether it contains a medical named entity for classification training data and the classification results are only 0 (not contained) and 1 (contained), for "breast cancer" the result of its containing a "breast" named entity is 1. Further, the result of the classification model may be a probability between 0 and 1 to represent the probability that the text data contains a medical named entity. In some other examples, other representations may be used to represent whether the text data contains a medical named entity. For example, 0 may be used to indicate inclusion and 1 may be used to indicate non-inclusion. According to one example of the present disclosure, generated unlabeled classification training data includes raw text data and a classification tag indicating whether the text data includes a medical named entity.
On the other hand, when "breast cancer" is used for training data of a medical named entity, each character needs to be labeled, for example, "breast" is labeled as "B-SITE" class, "gland" is labeled as "E-SITE" class, and "cancer" is labeled as "B-DIAG" class, where "B" denotes the beginning of the word, "E" denotes the end of the word, "SITE" is used for a medical named entity, and "DIAG" is used for a non-medical named entity. It should be noted by those skilled in the art that unlabeled and labeled as referred to herein refer to whether or not a character in the data is labeled.
According to some embodiments of the present disclosure, performing a first pre-process on each text data in the first text data set S102 to construct a massive amount of unlabeled classification training data S104 may include: acquiring a medical named entity knowledge conceptual diagram, wherein the medical named entity knowledge conceptual diagram has a predefined medical named entity; performing maximum text matching on the medical named entity knowledge conceptual graph and each text data in the first text data set S102 to determine whether each text data in the first text data set S102 contains one or more medical named entities; the mass of unlabeled classified training data S104 is constructed based on the text data comprising one or more medical named entities in the first text data set S102.
In one example, the medical named entity knowledge graph may be a medical named entity knowledge graph for identifying chemotherapeutic drugs. For example, the medical named entity knowledge conceptual graph for the example of table 1 may include the contents of table 2 below. The text data of table 1 can be obtained by performing maximum text matching on the medical named entity knowledge conceptual graph of table 2 and the text data, and therefore, an unlabeled piece of classification training data can be obtained.
TABLE 2
According to one embodiment of the present disclosure, constructing a mass of unlabeled classification training data S104 based on text data in the first text data set S102 containing one or more medical named entities may include: for first text data containing medical named entities in the first text data set S102, determining that the first text data contains L medical named entities, wherein L is greater than or equal to 1; splitting the first text data to generate L unlabeled class training data, wherein each class training data comprises a medical named entity; and constructing a mass of unlabeled classified training data S104 based on each text data in the first text data set S102 that contains a medical named entity.
In one example, as in the previous example, the text data shown in table 1 (i.e., having 1 medical named entity) may be maximally text matched with the medical named entity knowledge conceptual graph shown in table 2 to determine that the text data shown in table 1 has one medical named entity. An unlabeled collection of classified training data may then be constructed based on determining that the text data has a medical named entity.
In another example, another example text data is shown in table 3 below. As shown in table 3, the text content of the text data has 2 medical named entities "force naive" and "carboplatin" therein. The text data shown in table 3 is maximally text-matched with the medical named entity knowledge conceptual graph shown in table 2, and it can be determined that the text data shown in table 3 has 2 medical named entities. In this case, two sets of classification training data may be generated based on the text data, i.e. one set of classification training data is determined from the text data containing the medical named entity "force naive", which contains the text data of table 3, and the classification label indicating that the medical named entity "force naive", and the other set of classification training data is determined from the text data containing the medical named entity "carboplatin", which contains the text data of table 3, and the classification label indicating that the medical named entity "carboplatin".
TABLE 3 Table 3
In one example, the medical named entity knowledge conceptual graph may also have a large number of concepts of medical named entities instead of just the two medical named entities shown in table 2, thereby enabling matching of medical named entities of a particular class. In addition, the medical knowledge graph can also be provided with a plurality of different types of medical named entities for matching with unknown text data, so that a medical named entity recognition model with more robustness can be trained.
After obtaining the massive unlabeled classification training data S104, in step S204, the massive classification training data S104 may be input into a classification model to be trained to obtain a trained classification model, where the classification model includes a first coding module S106 and a first loss function module S108.
In one example, the purpose of training the two-classification model is to maximize the probability of determining that a medical named entity is contained in unknown text data based on determining that the classified training data for the medical named entity is contained in the text data.
According to some embodiments of the present disclosure, the result S110 of the first loss function module may include a probability that the massive unlabeled classification training data S104 includes a medical named entity.
In one example, the first encoding module S106 may generally employ a depth model structure, such as BiLSTM, LSTM, textCNN, TRANSFORMER or BERT, among other depth models.
In one example, the first encoding module S106 may encode the classification training data to output a vectorized representation of the classification training data.
In one example, the first loss function module S108 may generally employ a fully connected, bisecting neural network structure to maximize the probability of determining that the unknown text data contains a medical named entity.
It should be appreciated by those skilled in the art that the first encoding module S106 and the first loss function module S108 may have other network structures for further optimizing the model in addition to the above-described structure, and are not limited herein accordingly.
In the process of training the two-classification model, in step S206, the two-classification model may be trained to optimize the first encoding parameter set of the first encoding module S106 using the massive amount of classification training data S104 multiple times based on the result S110 of the first loss function module. Through multiple times of training, the first coding parameter set can reach an optimal solution, so that the probability that the text data are identified by the two-classification model part and contain medical named entities is maximized.
According to some embodiments of the present disclosure, step S206 may further include: training a classification model by using massive unlabeled classification training data S104 for multiple times to improve probability; when the probability exceeds a predetermined threshold, the current first encoding parameter set is used as the optimized first encoding parameter set, thereby avoiding wasting more computing resources only for relatively small increases in recognition probability.
After obtaining the preferred first set of encoding parameters, a second set of text data S112 may be obtained for the named entity recognition model in step S208, and a second pre-processing may be performed on each text data in the second set of text data S112 to generate a second number of annotated training data S114, wherein the annotation is an annotation of the medical named entity for each text data in the second set of text data S112.
In one example, the text data in the first text data set S102 and the second text data set S112 may be for the same type of medical named entity to better promote the quality of the training data so that a better named entity recognition training model may be trained. For example, the text data in the first text data set S102 and the second text data set S112 are both for drugs, or the text data in the first text data set S102 and the second text data set S112 are for diseases. Those skilled in the art will appreciate that the first text dataset S102 and the second text dataset S112 may be obtained with one or more named entity models of the same type, as desired.
In one example, the text data in the first text data set may be completely different from the text data in the second text data set.
In one example, the first amount of unlabeled categorized training data may be substantially greater than the second amount of labeled training data. For example, the number of labeled training data may be only a small amount, e.g., hundreds, while the number of unlabeled classified training data may be hundreds or thousands of times the number of labeled training data.
The labels are labels of medical named entities for each text data in the second set of text data S112, which are explained below by way of example.
For example, for text data as shown in table 4, it may be determined that blood routine and urine routine are contained in the text data. Thus, annotations for the text data may be received to generate annotated training data.
TABLE 4 Table 4
Performing the second pre-processing on each text data in the second set of text data S112 to generate a small amount of annotated training data S114 may further comprise: dividing each text data in the second text data set S112 into one or more words based on a predetermined word division rule; performing maximum text matching on each text data in the divided second text data set S112 and the medical knowledge database to generate a small amount of roughly marked training data S114; receiving annotation information for a small amount of coarsely annotated training data, wherein the annotation information includes a specific label for a character in the small amount of coarsely annotated training data; and generating a small amount of labeled training data based on the labeling information S114.
In one example, one example of a medical knowledge database may be as shown in table 5 below.
With continued reference to the text data of table 4, the text data of fig. 4 may be divided based on a predetermined vocabulary division rule, wherein the division result is shown in table 6.
TABLE 6
Further, by maximum text matching the medical knowledge database in table 5 with each of the divided words in table 6 to determine the text naming entity possessed by the text data, the coarsely labeled training data as shown in table 7 can be generated.
TABLE 7
However, although the process of text matching through a medical knowledge database may reduce the amount of labeling, it is often still necessary to receive manual labeling information for text data. The medical named entities have the following problems, so that named entities in text data cannot be accurately marked through a medical knowledge database, for example, (1) the medical field contains a large number of entity concepts, and the entity identification task is heavy; (2) The entity concept has more context constraints, and the same entity word can be different in different context entity types; (3) There may be a great difference in length between entities, and for some disease names and drug names, the length of the entities may be very long, even some entities may contain more than 10 characters, while the existing part of the entities contains only 1 character; and (4) cases of inclusion and crossing among entities.
In one example, the medical knowledge base database may also include medical named entity knowledge concept graphs. A medical named entity knowledge conceptual graph may be matched with each text data in the second text data set to generate coarsely annotated training data.
Accordingly, it is desirable to further receive labeling information for a small amount of coarsely labeled training data to generate labeled training data. For example, continuing with the example above, training data as shown in Table 8 may be generated via manually labeling missing named entities "urine routine".
TABLE 8
According to some embodiments of the present disclosure, a particular tag may include: a first label for each character in the non-medical named entity; a second label for a start character in the medical named entity; a third label for an intermediate character in the medical named entity; and a fourth label for an ending character in the medical named entity.
For example, in the notation of "do bloodroutine examination" for a portion of the content in the text data, where the first label may be a label "O" for a non-medical named entity, such as the characters "do", "examine" and "examine", the second label may be a label "B" for the beginning character of the medical named entity "bloodroutine", the third label may be a label "M" for the middle character of the medical named entity "bloodroutine", and the fourth label may be a label "N" for the ending character of the medical named entity "bloodroutine".
In one example, other similar labels may also be used to label characters in the text data, such as the examples previously described with respect to labeling "breast cancer. Further, the characters employed by the above-described labels are merely examples, and any suitable label symbol may be used instead.
After generating the small amount of labeled training data S114, in step S210, the small amount of labeled training data S114 may be input into a named entity recognition model for training to obtain a trained named entity recognition model, where the named entity recognition model includes a second encoding module S116 and a second loss function module S118, and the second encoding module S116 uses the optimized first encoding parameter set as an initial value of the second encoding parameter set of the second encoding module.
In one example, the second encoding module S106 may generally employ a depth model structure, such as BiLSTM, LSTM, textCNN, TRANSFORMER or BERT, among other depth models. In one example, the second encoding module may encode training data to output a vectorized representation of the training data.
In the technical scheme of the disclosure, the first coding module and the second coding module can have the same depth model structure so that the first coding parameter set can be effectively used as an initial value of the second coding parameter set, thereby realizing the migration of the characteristics of the classification model which is trained by a large amount of unlabeled classification training data to the named entity recognition model, avoiding the need of using a large amount of labeled training data for the named entity model, ensuring the accuracy of identifying the medical named entity, saving the labor cost and further improving the identification efficiency of the medical named entity.
Specifically, on one hand, the first parameter set subjected to the classification training has learned more semantic information, so that the training time of the named entity recognition model can be effectively shortened. For example, assume target bits 0.2, 0.8, and 0.2 of the training parameter set. Without using the parameters from the first encoding parameter set, the parameters of the second encoding parameter set are typically random, e.g. 0.5, 0.5. While the first set of encoding parameters migrated from the trained bipartite model may be, for example, 0.1, 0.9, 0.3. Thus, in the case where the learning step size is 1, the number of learning steps of the random parameter set will be 9, and the learning step size using the first encoding parameter set will be 3. Thus, using the first encoding parameter set as the initial value of the second encoding parameter set effectively reduces the training time of the named entity recognition model and avoids the use of a large amount of labeled training data.
On the other hand, using the initial value from the first encoding parameter set as the second encoding parameter set may avoid a situation where the named entity recognition model data may only get a locally optimal solution for training because less training data is used. Fig. 3 illustrates a schematic diagram of a locally optimal solution and a globally optimal solution according to some embodiments of the present disclosure. As shown in FIG. 3, the named entity recognition model may only learn x when the training data is small 1 Is a locally optimal solution of (1). Whereas x can be obtained initially when using the first set of coding parameters of the trained bi-classification model 2 Nearby solutions, thereby avoiding learning x 1 Directly find the overall optimal solution x 2
Furthermore, since the classification model does not require labeling of the data, the text data in the first and second text data sets S102, S112 may be text data for the same type of medical named entity. In this case, compared to the manner of migrating the parameter set of the trained named entity model to another named entity model of a different type as the initial value of the parameter set in the prior art, since the first coding parameter set of the present application is derived from the two-class model trained by the same type of massive unlabeled classified training data for the medical named entity, the parameters thereof will be more reliable, so that the technical solution of the present application can generally obtain the medical named entity recognition model with better effect when the medical named entity model is trained by using the same labeled data.
According to some embodiments of the disclosure, the total amount of parameters of the first encoding parameter set and the second encoding parameter set depends on the dimensions of the input vector of the depth model structure and the dimensions of the hidden layer.
According to some embodiments of the disclosure, when the dimension of the input vector of the depth model structure is n and the dimension of the hidden layer m, the total amount of parameters is 8 (m 2 +2m+mn)。
According to some embodiments of the present disclosure, the second loss function module may include a Conditional Random Field (CRF) model structure, wherein the random field model structure constrains the output result of the second encoding module based on predetermined constraints. By employing Conditional Random Fields (CRFs) at the loss layer, the medical named entity can be made to recognize constraints of model learning statements to ensure that labels generated when identifying the medical named entity are valid.
According to some embodiments of the present disclosure, the predetermined constraint may include: the start tag of each entity in the training data is constrained to be either a first tag or a second tag; the next tag after the second tag is constrained to be either the third tag or the fourth tag; and the next tag after the third tag is constrained to be either the third tag or the fourth tag.
For example, referring to the previous description of examples of specific tags, the tag of any entity in the text data should start with "B" or "O" and prohibit the use of "M" or "E" as a start. In addition, the next tag after "B" must be "M" or "E", and the tag after "M" must be "M" or "E". The constraints on conditional random fields can also be other constraints.
Then, in training the named entity recognition model, the named entity recognition model may be trained using a second amount of labeled training data multiple times based on the results of the second loss function module to obtain an optimized second set of encoding parameters. The named entity recognition model with the better second coding parameter set can accurately recognize the named entity in the medical text data, and a large amount of marked training data is not needed in the model training process, so that the labor cost is obviously reduced while the medical named entity recognition accuracy is ensured, and the named entity recognition efficiency is improved.
In one example, the result of the second loss function module is a sequence position of the labeling entity in the text data. Therefore, based on the result of the second loss function module, the accuracy of identifying the entity position by the medical named entity can be improved by training the named entity identification model by using the second quantity of labeled training data for multiple times to obtain the optimized second coding parameter set.
According to one embodiment of the disclosure, unlabeled text data may be input into a trained named entity recognition model to identify a medical named entity in the text data, where the trained named entity recognition model is obtained based on the above-described semi-supervised learning-based medical named entity recognition model training method.
The above detailed description of the medical named entity recognition model training method based on semi-supervised learning with reference to fig. 1 and 2 is implemented by constructing a large amount of unlabeled classified training data, training the classified model by using the classified training data to obtain a better coding parameter set, and further using the coding parameter set for the named entity recognition model, so that the named entity recognition model with the coding parameter set from the classified model can be trained by using the training data to obtain a named entity recognition model with good recognition effect under the condition that only a small amount of labeled training data is obtained, thereby avoiding the need to obtain a large amount of labeled training data for the named entity recognition model, and saving the labor cost for obtaining the labeled training data while ensuring the medical named entity recognition accuracy.
< second embodiment >
In addition to the above-mentioned semi-supervised medical named entity recognition model training method, the present disclosure also provides a semi-supervised medical named entity recognition model training apparatus, which will be described in detail below with reference to fig. 4 and 5.
Fig. 4 illustrates a block diagram of a semi-supervised based medical named entity recognition model training apparatus, according to some embodiments of the present disclosure. As shown in fig. 4, the medical named entity recognition model training apparatus 400 based on semi-supervised learning according to the present disclosure may include a classification training data constructing unit 410, a classification model training unit 420, a first encoding parameter set optimizing unit 430, a training data generating unit 440, a named entity recognition model training unit 450, and a second encoding parameter set optimizing unit 460.
According to some embodiments of the present disclosure, the classification model may classify based on whether the classification training data includes a medical named entity.
Fig. 5 illustrates a block diagram of a classification training data construction unit according to some embodiments of the present disclosure. As shown in fig. 5, the classification training data construction unit 410 may further include: a medical named entity knowledge conceptual graph acquisition component 510 that may be configured to acquire a medical named entity knowledge conceptual graph, wherein the medical named entity knowledge conceptual graph has a predefined medical named entity; a maximum text matching component 520 that can be configured to maximum text match the medical named entity knowledge conceptual graph with each text data in the first set of text data to determine whether each text data in the first set of text data contains one or more medical named entities; and a classification training data construction component 530 that may be configured to construct a mass of unlabeled classification training data based on the text data in the first text data set that includes one or more medical named entities.
According to some embodiments of the present disclosure, the classification training data construction component 530 may be further configured to: for first text data containing medical named entities in a first text data set, determining that the first text data contains L medical named entities, wherein L is greater than or equal to 1; splitting the first text data to generate L unlabeled class training data, wherein each class training data comprises a medical named entity; and constructing a mass of unlabeled class training data based on each text data in the first text data set that includes the medical named entity.
According to some embodiments of the present disclosure, the results of the first loss function module may include: the massive unlabeled classified training data comprises probabilities of medical named entities; and the first encoding parameter set optimizing unit 430 may be further configured to: training a classification model by using massive classification training data for multiple times to improve the probability; and when the probability exceeds a preset threshold value, taking the current first coding parameter set as an optimized first coding parameter set.
According to some embodiments of the present disclosure, the training data generation unit 440 may be further configured to: dividing each text data in the second text data set into one or more words based on a predetermined word division rule; performing maximum text matching on each text data in the divided second text data set and the medical knowledge database to generate a second number of roughly marked training data; receiving annotation information for the second quantity of coarsely annotated training data, wherein the annotation information includes a particular label for a character in the second quantity of coarsely annotated training data; and generating a second amount of annotated training data based on the annotation information.
According to some embodiments of the present disclosure, the specific tag may include: a first label for each character in the non-medical named entity; a second label for a start character in the medical named entity; a third label for an intermediate character in the medical named entity; and a fourth label for an ending character in the medical named entity.
According to some embodiments of the present disclosure, the second loss function module may comprise a conditional random field model structure, wherein the random field model structure may constrain the output result of the second encoding module based on predetermined constraints.
According to some embodiments of the present disclosure, the predetermined constraint may include: the start tag of each entity in the training data is constrained to be either a first tag or a second tag; the next tag after the second tag is constrained to be either the third tag or the fourth tag; and the next tag after the third tag is constrained to be either the third tag or the fourth tag.
According to some embodiments of the present disclosure, the first encoding module may encode the classification training data to output a vectorized representation of the classification training data; and the second encoding module may encode the training data to output a vectorized representation of the training data.
According to some embodiments of the present disclosure, the first encoding module and the second encoding module may comprise the same depth model structure, wherein the depth model structure may comprise at least one of BILSTM, LISTM, textCNN, TRANSFORMER or BERT.
According to some embodiments of the present disclosure, the total amount of parameters of the first and second encoding parameter sets may depend on the dimensions of the input vector of the depth model structure and the dimensions of the hidden layer.
According to some embodiments of the present disclosure, when the dimension of the input vector of the depth model structure is n and the dimension of the hidden layer m, the total amount of parameters may be 8 (m 2 +2m+mn)。
According to some embodiments of the present disclosure, there is also disclosed another medical named entity recognition device based on semi-supervised learning, which may include: a medical named entity recognition unit may be configured to input unlabeled text data into a trained named entity recognition model to identify medical named entities in the text data, wherein the trained named entity recognition model is obtained based on the aforementioned semi-supervised learning-based medical named entity recognition method.
For some specific details regarding the semi-supervised learning based medical named entity recognition model training apparatus shown in fig. 4-5, reference may also be made to the content of the semi-supervised learning based medical named entity recognition model training method shown in fig. 1-2.
Fig. 6 illustrates a block diagram of an electronic device, according to some embodiments of the present disclosure.
Referring to fig. 6, an electronic device 600 may include a processor 601 and a memory 602. The processor 601 and the memory 602 may be connected by a bus 603. The electronic device 600 may be any type of portable device (e.g., smart camera, smart phone, tablet, etc.) or any type of stationary device (e.g., desktop computer, server, etc.).
The processor 601 may perform various actions and processes according to programs stored in the memory 602. In particular, the processor 601 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.
The memory 602 stores computer executable instructions that, when executed by the processor 601, implement the above-described semi-supervised learning-based medical named entity recognition model training method and semi-supervised learning-based medical named entity recognition method. The memory 602 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
Further, the medical named entity recognition model training method based on semi-supervised learning and the medical named entity recognition method based on semi-supervised learning according to the present disclosure may be recorded in a computer-readable recording medium. In particular, according to the present disclosure, a computer-readable recording medium storing computer-executable instructions that, when executed by a processor, cause the processor to perform the semi-supervised learning-based medical named entity recognition model training method and the semi-supervised learning-based medical named entity recognition method as described above may be provided.
It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims (28)

1. A medical naming entity recognition model training method based on semi-supervised learning comprises the following steps:
acquiring a first text data set, and performing first preprocessing on each text data in the first text data set to construct a first number of unlabeled classification training data;
inputting the first amount of classification training data into a classification model for training to obtain a trained classification model, wherein the classification model classifies based on whether the classification training data includes a medical named entity, and wherein the classification model includes a first encoding module and a first loss function module;
Training the classification model using the first amount of classification training data multiple times to optimize a first set of encoding parameters for the first encoding module based on the results of the first loss function module;
obtaining a second text data set, performing a second pre-process on each text data in the second text data set to generate a second number of annotated training data, wherein the annotations are annotations of medical named entities for each text data in the second text data set, the first number being greater than the second number;
inputting the second quantity of labeled training data into a named entity recognition model for training to obtain a trained named entity recognition model, wherein the named entity recognition model comprises a second coding module and a second loss function module, and the second coding module uses the optimized first coding parameter set as an initial value of a second coding parameter set of the second coding module; and
based on the results of the second loss function module, the named entity recognition model is trained using the second amount of annotated training data multiple times to obtain an optimized second set of encoding parameters.
2. The method of claim 1, wherein performing a first pre-process on each text data in the first set of text data to construct a first amount of unlabeled classification training data comprises:
acquiring a medical named entity knowledge conceptual graph, wherein the medical named entity knowledge conceptual graph has a predefined medical named entity;
performing maximum text matching on the medical named entity knowledge conceptual graph and each text data in the first text data set to determine whether each text data in the first text data set contains one or more medical named entities;
the first number of unlabeled classification training data is constructed based on text data in the first text data set that includes one or more medical named entities.
3. The method of claim 2, wherein constructing the first number of unlabeled classification training data based on text data in the first text data set that includes one or more medical named entities comprises:
determining, for first text data comprising medical named entities in the first text data set, that the first text data comprises L medical named entities, wherein L is greater than or equal to 1;
Splitting the first text data to generate L unlabeled classified training data, wherein each classified training data comprises a medical named entity; and
based on each text data in the first set of text data that includes a medical named entity, a first number of unlabeled classification training data is constructed.
4. The method of claim 2, wherein the results of the first loss function module comprise: the first number of unlabeled class training data includes probabilities of medical named entities; and
training the classification model using the first amount of classification training data multiple times to optimize a first set of encoding parameters of the first encoding module based on the results of the first loss function module includes:
training the classification model using the first amount of classification training data multiple times to increase the probability;
and when the probability exceeds a preset threshold value, taking the current first coding parameter set as the optimized first coding parameter set.
5. The method of claim 1, wherein performing a second pre-process on each text data in the second set of text data to generate a second number of annotated training data comprises:
Dividing each text data in the second text data set into one or more words based on a predetermined word division rule;
performing maximum text matching on each text data in the divided second text data set and a medical knowledge database to generate a second number of roughly marked training data;
receiving annotation information for the second quantity of coarsely annotated training data, wherein the annotation information includes a particular label for a character in the second quantity of coarsely annotated training data; and
the second amount of annotated training data is generated based on the annotation information.
6. The method of claim 5, wherein the particular tag comprises:
a first label for each character in the non-medical named entity;
a second label for a start character in the medical named entity;
a third label for an intermediate character in the medical named entity; and
a fourth label for an ending character in the medical named entity.
7. The method of claim 6, wherein the second loss function module comprises a conditional random field model structure, wherein the random field model structure constrains the output result of the second encoding module based on predetermined constraints.
8. The method of claim 7, wherein the predetermined constraint comprises:
the start tag of each entity in the training data is constrained to be either a first tag or a second tag;
the next tag after the second tag is constrained to be either a third tag or a fourth tag; and
the next tag after the third tag is constrained to be either the third tag or the fourth tag.
9. The method of claim 1, wherein the first encoding module encodes classification training data to output a vectorized representation of the classification training data; and
the second encoding module encodes training data to output a vectorized representation of the training data.
10. The method of claim 9, wherein the first encoding module and the second encoding module comprise a same depth model structure, wherein the depth model structure comprises at least one of BILSTM, LISTM, textCNN, TRANSFORMER or BERT.
11. The method of claim 10, wherein a total amount of parameters of the first and second encoding parameter sets depends on a dimension of an input vector of the depth model structure and a dimension of a hidden layer.
12. The method of claim 11, wherein the total amount of parameters is 8 (m 2 +2m+mn)。
13. A medical named entity identification method based on semi-supervised learning comprises the following steps:
the unlabeled text data is input into a trained named entity recognition model to identify a medical named entity in the text data,
wherein the trained named entity recognition model is obtained based on the semi-supervised learning-based medical named entity recognition model training method of any of claims 1-12.
14. A medical named entity recognition model training device based on semi-supervised learning, comprising:
a classification training data construction unit configured to acquire a first text data set, perform a first preprocessing on each text data in the first text data set to construct a first number of unlabeled classification training data;
a classification model training unit configured to input the first amount of classification training data into a classification model for training to obtain a trained classification model, wherein the classification model classifies based on whether the classification training data comprises a medical named entity, and wherein the classification model comprises a first encoding module and a first loss function module;
A first encoding parameter set optimizing unit configured to train the classification model using the first number of classification training data a plurality of times to optimize a first encoding parameter set of the first encoding module based on a result of the first loss function module;
a training data generation unit configured to obtain a second set of text data, perform a second pre-processing on each text data in the second set of text data to generate a second number of annotated training data, wherein the annotations are annotations of medical named entities for each text data in the second set of text data, the first number being greater than the second number;
a named entity recognition model training unit configured to input the second amount of annotated training data into a named entity recognition model for training to obtain a trained named entity recognition model, the named entity recognition model comprising a second encoding module and a second loss function module, wherein the second encoding module uses the optimized first encoding parameter set as an initial value of a second encoding parameter set of the second encoding module; and
and a second encoding parameter set optimizing unit configured to train the named entity recognition model using the second amount of labeled training data a plurality of times based on a result of the second loss function module to obtain an optimized second encoding parameter set.
15. The apparatus of claim 14, wherein the classification training data construction unit further comprises:
a medical named entity knowledge concept graph acquisition component configured to acquire a medical named entity knowledge concept graph, wherein the medical named entity knowledge concept graph has a predefined medical named entity;
a maximum text matching component configured to maximum text match the medical named entity knowledge conceptual graph with each text data in the first text data set to determine whether each text data in the first text data set contains one or more medical named entities;
a classification training data construction component configured to construct the first quantity of unlabeled classification training data based on text data in the first text data set comprising one or more medical named entities.
16. The apparatus of claim 15, wherein the classification training data construction component is further configured to:
determining, for first text data comprising medical named entities in the first text data set, that the first text data comprises L medical named entities, wherein L is greater than or equal to 1;
Splitting the first text data to generate L unlabeled classified training data, wherein each classified training data comprises a medical named entity; and
based on each text data in the first set of text data that includes a medical named entity, a first number of unlabeled classification training data is constructed.
17. The apparatus of claim 15, wherein the result of the first loss function module comprises: the first number of unlabeled class training data includes probabilities of medical named entities; and
the first encoding parameter set optimizing unit is further configured to:
training the classification model using the first amount of classification training data multiple times to increase the probability;
and when the probability exceeds a preset threshold value, taking the current first coding parameter set as the optimized first coding parameter set.
18. The apparatus of claim 14, wherein the training data generation unit is further configured to:
dividing each text data in the second text data set into one or more words based on a predetermined word division rule;
performing maximum text matching on each text data in the divided second text data set and a medical knowledge database to generate a second number of roughly marked training data;
Receiving annotation information for the second quantity of coarsely annotated training data, wherein the annotation information includes a particular label for a character in the second quantity of coarsely annotated training data; and
the second amount of annotated training data is generated based on the annotation information.
19. The apparatus of claim 18, wherein the particular tag comprises:
a first label for each character in the non-medical named entity;
a second label for a start character in the medical named entity;
a third label for an intermediate character in the medical named entity; and
a fourth label for an ending character in the medical named entity.
20. The apparatus of claim 19, wherein the second loss function module comprises a conditional random field model structure, wherein the random field model structure constrains the output result of the second encoding module based on predetermined constraints.
21. The apparatus of claim 20, wherein the predetermined constraint comprises:
the start tag of each entity in the training data is constrained to be either a first tag or a second tag;
the next tag after the second tag is constrained to be either a third tag or a fourth tag; and
The next tag after the third tag is constrained to be either the third tag or the fourth tag.
22. The apparatus of claim 14, wherein the first encoding module encodes classification training data to output a vectorized representation of the classification training data; and
the second encoding module encodes training data to output a vectorized representation of the training data.
23. The apparatus of claim 22, wherein the first encoding module and the second encoding module comprise a same depth model structure, wherein the depth model structure comprises at least one of BILSTM, LISTM, textCNN, TRANSFORMER or BERT.
24. The apparatus of claim 23, wherein a total amount of parameters of the first and second encoding parameter sets depends on a dimension of an input vector of the depth model structure and a dimension of a hidden layer.
25. The apparatus of claim 24, wherein the total amount of parameters is 8 (m 2 +2m+mn)。
26. A medical named entity recognition device based on semi-supervised learning, comprising:
A medical named entity recognition unit configured to input unlabeled text data into a trained named entity recognition model to identify a medical named entity in the text data, wherein the trained named entity recognition model is obtained based on the semi-supervised learning-based medical named entity recognition model training method of any of claims 1-12.
27. An electronic device, comprising:
a processor; and
a memory, wherein the memory has stored therein computer readable code which, when executed by the processor, implements the semi-supervised learning based medical named entity recognition model training method of any of claims 1-12 or the semi-supervised learning based medical named entity recognition method of claim 13.
28. A non-transitory computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the semi-supervised learning based medical named entity recognition model training method of any of claims 1-12 or the semi-supervised learning based medical named entity recognition method of claim 13.
CN202211656436.7A 2022-12-22 2022-12-22 Medical named entity recognition model training method, device, equipment and medium Active CN115859984B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211656436.7A CN115859984B (en) 2022-12-22 2022-12-22 Medical named entity recognition model training method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211656436.7A CN115859984B (en) 2022-12-22 2022-12-22 Medical named entity recognition model training method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115859984A CN115859984A (en) 2023-03-28
CN115859984B true CN115859984B (en) 2024-01-23

Family

ID=85653865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211656436.7A Active CN115859984B (en) 2022-12-22 2022-12-22 Medical named entity recognition model training method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115859984B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118173215A (en) * 2024-05-14 2024-06-11 北京壹永科技有限公司 Small model training method, method for treating tumor clinical record data and device thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107527073A (en) * 2017-09-05 2017-12-29 中南大学 The recognition methods of entity is named in electronic health record
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110188331A (en) * 2019-06-03 2019-08-30 腾讯科技(深圳)有限公司 Model training method, conversational system evaluation method, device, equipment and storage medium
CN110399616A (en) * 2019-07-31 2019-11-01 国信优易数据有限公司 Name entity detection method, device, electronic equipment and readable storage medium storing program for executing
CN114036950A (en) * 2021-11-10 2022-02-11 山东大学 Medical text named entity recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11681944B2 (en) * 2018-08-09 2023-06-20 Oracle International Corporation System and method to generate a labeled dataset for training an entity detection system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107527073A (en) * 2017-09-05 2017-12-29 中南大学 The recognition methods of entity is named in electronic health record
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110188331A (en) * 2019-06-03 2019-08-30 腾讯科技(深圳)有限公司 Model training method, conversational system evaluation method, device, equipment and storage medium
CN110399616A (en) * 2019-07-31 2019-11-01 国信优易数据有限公司 Name entity detection method, device, electronic equipment and readable storage medium storing program for executing
CN114036950A (en) * 2021-11-10 2022-02-11 山东大学 Medical text named entity recognition method and system

Also Published As

Publication number Publication date
CN115859984A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN111222317B (en) Sequence labeling method, system and computer equipment
CN110059320B (en) Entity relationship extraction method and device, computer equipment and storage medium
CN110532397B (en) Question-answering method and device based on artificial intelligence, computer equipment and storage medium
CN112015900B (en) Medical attribute knowledge graph construction method, device, equipment and medium
US11797585B2 (en) Data updating method and apparatus, electronic device and computer readable storage medium
WO2022227162A1 (en) Question and answer data processing method and apparatus, and computer device and storage medium
US20190163699A1 (en) Method and apparatus for information interaction
CN112766319B (en) Dialogue intention recognition model training method, device, computer equipment and medium
CN111797629B (en) Method and device for processing medical text data, computer equipment and storage medium
CN111639178A (en) Automatic classification and interpretation of life science documents
US20170185913A1 (en) System and method for comparing training data with test data
WO2021169101A1 (en) Method and apparatus for generating medical image recognition model, computer device and medium
CN115859984B (en) Medical named entity recognition model training method, device, equipment and medium
US20230177267A1 (en) Automated classification and interpretation of life science documents
CN112328655B (en) Text label mining method, device, equipment and storage medium
CN112652386A (en) Triage data processing method and device, computer equipment and storage medium
CN113903422A (en) Medical image diagnosis report entity extraction method, device and equipment
CN112749277A (en) Medical data processing method and device and storage medium
CN116861881A (en) Data processing method, device, equipment and medium
CN110956043A (en) Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization
CN113743118B (en) Entity relation extraction method in legal document based on fusion relation information coding
CN115221288A (en) Semantic analysis method, semantic analysis device, electronic device, and storage medium
CN114238715A (en) Question-answering system based on social aid, construction method, computer equipment and medium
CN114417016A (en) Knowledge graph-based text information matching method and device and related equipment
US10522246B2 (en) Concepts for extracting lab data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant