CN114912456B

CN114912456B - Medical entity relationship identification method and device and storage medium

Info

Publication number: CN114912456B
Application number: CN202210844619.5A
Authority: CN
Inventors: 凌鸿顺; 王实; 张奇
Original assignee: Beijing Huimeiyun Technology Co ltd
Current assignee: Beijing Huimeiyun Technology Co ltd
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-09-23
Anticipated expiration: 2042-07-19
Also published as: CN114912456A

Abstract

The application provides a method, a device and a storage medium for identifying medical entity relationship, wherein the method comprises the following steps: acquiring a target electronic medical record text; performing entity word recognition on the target electronic medical record text through a predetermined entity word recognition model, and recognizing medical entity words and entity types of each medical entity word included in the target electronic medical record text; adding an identity identifier and an entity identifier to each identified medical entity word; sequentially arranging and combining all medical entity words added with the identity marks and the entity identification marks according to the reading sequence of the target electronic medical record text to generate a multi-group phrase; and inputting the multi-group phrases into a pre-trained entity relationship recognition model, and determining the recognition result of the medical entity relationship of the target electronic medical record text. Therefore, the recognition method provided by the scheme can effectively improve the accuracy of medical entity relationship recognition.

Description

Medical entity relationship identification method and device and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a method, an apparatus, and a storage medium for identifying medical entity relationships.

Background

With the rapid development of hospital informatization, more and more medical data are accumulated, wherein the most basic medical data is an electronic medical record, however, since most of the electronic medical record is made of natural language and contains partial unstructured data, the useful information in the electronic medical record cannot be directly used by a clinical decision system depending on the structured data, and thus text data normalization processing is required. The normalization of the text data plays an important role in realizing the applications of a clinical decision auxiliary system, content quality control, differential diagnosis and the like, wherein the medical entity relationship is identified as an important link in the normalization processing process of the text data.

The medical entity relationship identification refers to extracting entity words such as anatomical parts, diagnosis, tumor stages and the like from the electronic medical record, and establishing relationships based on the extracted entity words to form meaningful phrases. However, currently, the entity relationship is generally obtained by extracting features of different dimensions by using multiple CNN (Convolutional Neural Networks) and LSTM (Long Short-Term-Memory artificial Neural Networks) deep learning Networks, and then combining the multiple CNN and LSTM deep learning Networks together to select the entity relationship corresponding to the sample. However, the current method does not consider the upper and lower semantic information among different entities, which results in inaccurate identification.

Disclosure of Invention

In view of this, an object of the present application is to provide a method, an apparatus, and a storage medium for recognizing a medical entity relationship, which can effectively improve accuracy of recognizing the medical entity relationship.

The embodiment of the application provides a method for identifying medical entity relationships, which comprises the following steps:

acquiring a target electronic medical record text;

performing entity word recognition on the target electronic medical record text through a predetermined entity word recognition model, and recognizing medical entity words and entity types of each medical entity word included in the target electronic medical record text;

adding an identity identifier and an entity identifier to each identified medical entity word; the identity identification comprises a character identity identification of each character in each medical entity word and a type identity identification of each medical entity word determined according to the entity type of each medical entity word, and the entity identity identification is used for determining whether the medical entity word is a word to be identified;

sequentially arranging and combining all medical entity words added with the identity marks and the entity identification marks according to the reading sequence of the target electronic medical record text to generate a multi-group phrase;

inputting the multi-element phrases into a pre-trained entity relationship recognition model, and determining the recognition result of the medical entity relationship of the target electronic medical record text; the entity relationship recognition model is a student model which is trained through knowledge distillation technology and used for recognizing medical entity relationships.

Optionally, the entity relationship identification model is constructed by the following steps:

acquiring a text sample set constructed based on an electronic medical record text to be trained; the text sample set comprises corresponding real label data;

fine-tuning the pre-training language model based on the text sample set, and determining the fine-tuned pre-training language model as a teacher model; the pre-training language model is a bert model;

respectively inputting the text sample set into a teacher model and an initial student model to obtain soft label data output by the teacher model; the initial student model is a CNN model;

determining a distillation loss function based on the real tag data and the soft tag data;

and carrying out iterative training on the initial student model based on the distillation loss function until the initial student model converges to obtain an entity relationship recognition model.

Optionally, the determining a distillation loss function based on the real label data and the soft label data comprises:

determining a first loss function based on the soft tag data;

determining a second loss function based on the genuine tag data;

and performing weighted summation on the first loss function and the second loss function to determine the distillation loss function.

Optionally, the text sample set is determined by:

acquiring a plurality of electronic medical record texts to be trained;

aiming at each electronic medical record text to be trained, entity word recognition is carried out on the electronic medical record text to be trained through the entity word recognition model, and medical entity words and entity types of each medical entity word included in the electronic medical record text to be trained are determined;

according to a preset entity type combination rule, based on the entity type of each medical entity word in the electronic medical record text to be trained, the medical entity words included in the electronic medical record text to be trained are screened, the screened medical entity words are combined, and a sample to be trained corresponding to the electronic medical record text to be trained is generated; the entity type combination rule specifies the entity type of medical entity words required to be included in the generated sample to be trained;

and forming the text sample set based on the to-be-trained samples corresponding to all the to-be-trained electronic medical record texts.

Optionally, the entity type combination rule is constructed according to a preset medical project; wherein the medical item specifies the entity type of the medical entity word required to be included when having the medical entity relationship.

Optionally, the fine-tuning the pre-training language model based on the text sample set, and determining the fine-tuned pre-training language model as a teacher model includes:

aiming at each sample to be trained in the text sample set, adding a corresponding type identifier to each medical entity word according to the entity type of each medical entity word in the sample to be trained, and adding a classification identifier to the starting end of the sample to be trained;

and taking each sample to be trained added with the type identifier and the classification identifier as an input characteristic of a pre-training language model, taking real label data corresponding to each sample to be trained as an output characteristic of the pre-training language model, finely adjusting the pre-training language model, and determining the finely adjusted pre-training language model as a teacher model.

Optionally, the medical items include symptoms, medications, surgery, scoring sheets, tests, and examinations.

The embodiment of the present application further provides an apparatus for identifying a medical entity relationship, where the apparatus includes:

the acquisition module is used for acquiring a target electronic medical record text;

the identification module is used for carrying out entity word identification on the target electronic medical record text through a predetermined entity word identification model, and identifying medical entity words and entity types of each medical entity word in the target electronic medical record text;

the adding module is used for adding an identity and an entity identification to each recognized medical entity word; the identity identification comprises a character identity identification of each character in each medical entity word and a type identity identification of each medical entity word determined according to the entity type of each medical entity word, and the entity identity identification is used for determining whether the medical entity word is a word to be identified;

the generating module is used for sequentially arranging and combining all medical entity words added with the identity marks and the entity identification marks according to the reading sequence of the target electronic medical record text to generate a multi-group phrase;

the first determining module is used for inputting the multi-element phrases into a pre-trained entity relationship recognition model and determining the recognition result of the medical entity relationship of the target electronic medical record text; the entity relationship recognition model is a student model which is trained through a knowledge distillation technology and used for medical entity relationship recognition.

Optionally, the identification apparatus further includes a model building module, where the model building module is configured to:

Optionally, when the model building module is configured to determine the distillation loss function based on the real label data and the soft label data, the model building module is configured to:

determining a first loss function based on the soft tag data;

determining a second loss function based on the genuine tag data;

Optionally, the identification apparatus further includes a second determining module, where the second determining module is configured to:

acquiring a plurality of electronic medical record texts to be trained;

Optionally, when the model building module is configured to perform fine tuning on the pre-training language model based on the text sample set and determine the fine-tuned pre-training language model as the teacher model, the model building module is configured to:

Optionally, the medical items include symptoms, drugs, surgery, scoring, tests, and examinations.

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the identification method as described above.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the identification method as described above.

The embodiment of the application provides a method, a device and a storage medium for identifying medical entity relationships, wherein the identification method comprises the following steps: acquiring a target electronic medical record text; performing entity word recognition on the target electronic medical record text through a predetermined entity word recognition model, and recognizing medical entity words and entity types of each medical entity word included in the target electronic medical record text; adding an identity identifier and an entity identifier to each identified medical entity word; the identity identification comprises a character identity identification of each character in each medical entity word and a type identity identification of each medical entity word determined according to the entity type of each medical entity word, and the entity identity identification is used for determining whether the medical entity word is a word to be identified; sequentially arranging and combining all medical entity words added with the identity marks and the entity identification marks according to the reading sequence of the target electronic medical record text to generate a multi-group phrase; inputting the multi-element phrases into a pre-trained entity relationship recognition model, and determining the recognition result of the medical entity relationship of the target electronic medical record text; the entity relationship recognition model is a student model which is trained through a knowledge distillation technology and used for medical entity relationship recognition.

Therefore, the medical item and the entity relation needing to be identified below the medical item are defined based on the priori knowledge in the medical field, and identification becomes simpler based on the specific range, so that the model can be better learned; fine adjustment of entity relationship recognition downstream tasks is performed through the bert pre-training language model, and an entity enhancement method is adopted, so that the recognition effect of the bert model is improved; by adopting the knowledge distillation technology, the cnn effect approaches to the bert model, so that the problems of low inference speed of the bert model and low accuracy of cnn identification are solved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a method for identifying medical entity relationships according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an input feature of an entity relationship recognition model provided herein;

FIG. 3 is a schematic structural diagram of input features of a pre-trained language model constructed in the present application;

fig. 4 is a schematic structural diagram of an apparatus for identifying medical entity relationships according to an embodiment of the present application;

fig. 5 is a second schematic structural diagram of an apparatus for identifying medical entity relationships according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that one skilled in the art can obtain without inventive effort based on the embodiments of the present application falls within the scope of protection of the present application.

With the rapid development of hospital informatization, more and more medical data are accumulated, wherein the most basic medical data is electronic medical records, however, since most of the electronic medical records are made of natural language and contain partial unstructured data, useful information in the electronic medical records cannot be directly used by a clinical decision system depending on the structured data, and text data normalization processing is required. The medical entity relationship is identified as an important link in the normalization processing process of the text data.

The medical entity relationship identification refers to extracting entity words such as anatomical parts, diagnosis, tumor stages and the like from the electronic medical record, and establishing relationships based on the extracted entity words to form meaningful phrases. Currently, the entity relationship is generally obtained by extracting features of different dimensions by using various CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory artificial Neural Networks) deep learning Networks, and then combining the various CNN and LSTM deep learning Networks together to select the entity relationship corresponding to the sample. But the current mode has the problem of inaccurate identification.

Based on this, the embodiment of the application provides a method, a device and a storage medium for identifying a medical entity relationship, which can effectively improve the accuracy of identifying the medical entity relationship.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for identifying a medical entity relationship according to an embodiment of the present disclosure. As shown in fig. 1, an identification method provided in an embodiment of the present application includes:

s101, acquiring a target electronic medical record text.

Here, the target electronic medical record text is a text recorded in an electronic medical record, and the target electronic medical record may be input by a user, may be stored in a local storage, and may also be stored in a cloud server. The target electronic medical record text is a text needing medical entity relationship identification.

The medical entity relationship identification refers to extracting entities such as anatomical parts, diagnoses, tumor stages and the like from the electronic medical records, and forming meaningful phrases based on the extracted entities.

For example, the acquired target electronic medical record text may be "bilateral neck, axilla, inguinal lymph node visible", and the subsequent work is to determine whether the text has a medical entity relationship.

S102, carrying out entity word recognition on the target electronic medical record text through a predetermined entity word recognition model, and recognizing medical entity words and entity types of each medical entity word included in the target electronic medical record text.

Here, the entity word recognition model is a pre-trained model that can perform medical entity word recognition and entity type recognition corresponding to the medical entity word. After the entity type of each medical entity word is identified, a corresponding entity type label may be added to the medical entity word.

The entity word recognition model can determine the type of an entity in advance and determine the entity type of each medical entity word; the medical entity words are named entity words with medical information; the entity types can include various medically meaningful entity types such as an anatomical region BDY, an orientation POS, a symptom SYM, an observation object WAT, a property ATT, a time TIM, a presence state EXT and the like, and can be divided into 37 entity types.

For example, the result of entity word recognition on the electronic medical record text "bilateral neck, axillary, inguinal lymph node visible" is "bilateral (POS) neck (BDY) axillary (BDY) inguinal lymph node (BDY) visible (EXT)".

S103, adding an identity and an entity identification to each recognized medical entity word.

Here, the identity includes a character identity of each character in each medical entity word and a type identity of each medical entity word determined according to an entity type of each medical entity word, and the entity identity is used to determine whether the medical entity word is a word to be recognized.

The method is characterized in that a medical single-character dictionary can be constructed in advance, each character is given a character identification (id), the id is maintained by a word list and is called a single-character dictionary, and each character only corresponds to a unique id. Thus, the character identity corresponding to each character in the medical entity word can be determined.

When the type identity corresponding to each medical entity word is determined, the type identity can be determined through a pre-constructed entity type dictionary, each entity type in the entity type dictionary corresponds to an id which is similar to a word id, a word list is used for maintenance, and each entity type only corresponds to a unique id. Therefore, the entity type and the entity type dictionary of each medical entity word are identified according to the entity word identification model, and the type identity of each medical entity word in the electronic medical record text can be determined.

It should be noted that adding the identity and the entity identifier to each medical entity word makes it easier for the entity relationship recognition model to recognize the input features. And in the training process of the entity relationship recognition model, identity identification and entity recognition identification can be added to the training sample.

For example, please refer to fig. 2, fig. 2 is a schematic structural diagram of an input feature of the entity relationship recognition model provided in the present application. As shown in fig. 2, in the third row of type identifiers, since the physical types of the neck, armpit, inguinal region, and lymph node are the same, the corresponding type identifiers are the same. The fourth row of the entity identifiers is marked as 1 because all medical entity words in the 'bilateral neck axillary inguinal lymph node visible' are words to be identified, wherein if a certain or some entity words do not need to be identified, 0 can be marked at the corresponding position of the word identifier.

And S104, sequentially arranging and combining all medical entity words added with the identity marks and the entity identification marks according to the reading sequence of the target electronic medical record text to generate a multi-group phrase.

Here, the reason why the electronic medical record text needs to be rearranged and combined is that the text of the electronic medical record may contain some characters, words, symbols, and the like which are not helpful for determining the medical entity relationship, so that when determining the medical entity relationship, these information need to be deleted, and only the identified entity words with medical meaning are retained, thereby generating a multi-component phrase which needs to be subjected to medical entity word identification.

The multi-tuple phrase comprises medical entity words of multiple entity types, and the number of the medical entity words can be multiple.

And S105, inputting the multi-element phrases into a pre-trained entity relationship recognition model, and determining the recognition result of the medical entity relationship of the target electronic medical record text.

Here, the entity relationship recognition model is a student model trained by knowledge distillation technology to be used for medical entity relationship recognition; and the entity relationship identification model carries out medical entity relationship identification on the input multi-element phrase and determines the identification result of the medical entity relationship of the multi-element phrase, wherein the identification result is also the identification result of the medical entity relationship of the target electronic medical record text. The determined recognition result is a binary classification result, and the binary classification result comprises 'having a relation' and 'not having a relation'.

In one embodiment provided by the present application, the entity relationship identification model is constructed by: acquiring a text sample set constructed based on an electronic medical record text to be trained; the text sample set comprises corresponding real label data; fine-tuning the pre-training language model based on the text sample set, and determining the fine-tuned pre-training language model as a teacher model; the pre-training language model is a bert model; respectively inputting the text sample set into a teacher model and an initial student model to obtain soft label data output by the teacher model; the initial student model is a CNN model; determining a distillation loss function based on the real tag data and the soft tag data; and carrying out iterative training on the initial student model based on the distillation loss function until the initial student model converges to obtain an entity relationship recognition model.

Here, the text sample set includes positive sample data (a sample having a medical entity relationship) and negative sample data (a sample having no medical entity relationship), where the true tag data corresponding to the positive sample data is related, and the true tag data corresponding to the negative sample data is unrelated.

In another embodiment provided herein, the text sample set is determined by: acquiring a plurality of electronic medical record texts to be trained; aiming at each electronic medical record text to be trained, entity word recognition is carried out on the electronic medical record text to be trained through the entity word recognition model, and medical entity words and entity types of each medical entity word included in the electronic medical record text to be trained are determined; according to a preset entity type combination rule, based on the entity type of each medical entity word in the electronic medical record text to be trained, the medical entity words included in the electronic medical record text to be trained are screened, the screened medical entity words are combined, and a sample to be trained corresponding to the electronic medical record text to be trained is generated; the entity type combination rule specifies the entity type of medical entity words required to be included in the generated sample to be trained; and forming the text sample set based on the to-be-trained samples corresponding to all the to-be-trained electronic medical record texts.

Here, the text of the medical record to be trained is a text recorded in an electronic medical record. The entity type combination rule specifies that medical entity words of specific entity types have medical entity relationships when existing simultaneously, that is, the entity type combination rule defines the combination form of the entity types, and here, the entity type combination rule may have the combination form of multiple entity types.

Here, the entity type combination rule is constructed according to a preset medical item; wherein the medical item specifies the entity type of the medical entity words required to be included when having the medical entity relationship. Wherein the preset medical items at least comprise symptoms, medicines, operations, scoring tables, examinations and examinations.

Wherein the medical item category is determined according to medical research content, six medical items are defined, wherein the entity types of each medical item are combined as follows:

the combination form of entity types corresponding to the symptom items is [ symptom SYM + existing state EXA/MAY/NEG + property ATT + position or orientation POS/BDY + time TIM ]; the combination form of the entity types corresponding to the drug items is [ drug name MED + route of administration ROU + dose DOS + Specification SPE + frequency FRE + time TIM ]; the combination form of entity types corresponding to the operation items is [ access OPR + operation name OPM + start time TIM + end time TIM + operation duration TIM + part POS + implant OPX ]; the combination form of the entity types corresponding to the scoring items is [ scoring table name WAT + numerical value VAU + unit UNT + time point TIM + time period TIM ]; the entity needing to be identified in the inspection item is [ inspection detail entity name WAT + subordinate inspection list name WAT + time point TIM + time period TIM + numerical value VAU + unit UNT + existing state EXA/MAY/NEG ]; the entities to be identified by the inspection items are [ inspection method EXA + inspection site POS/BDY + value VAU + unit UNT + inspection conclusion SYM + report time TIM ].

It should be noted that, the plurality of entity types in the combination of entity types corresponding to each symptom item may be divided into necessary entity types and unnecessary entity types, for example, the symptom SYM, the time TIM, and the transfer site POS/BDY in the combination of entity types corresponding to the symptom item are necessary entity types, and the rest are unnecessary entity types. And the samples which accord with the combination form of the entity types in the entity type combination rule are positive sample data, and the samples which do not accord with the combination form of the entity types in the entity type combination rule are negative sample data. The medical entity words of the necessary entity types must be included in the positive sample data, and the words of the unnecessary entity types may or may not be included.

In an example, taking an electronic medical record text to be trained as "bilateral neck axillary inguinal lymph node visible", a determination process of a text sample set is described, and according to an entity type combination rule and types of entity words and entity words included in the text, it is determined that the text belongs to an examination category, and then a relationship pair (a training sample) for identifying that there is a medical entity relationship in the examination category is: { bilateral POS, cervical BDY, lymph node BDY, visible EXT }, { axillary BDY, lymph node BDY, visible EXT }, { inguinal region BDY, lymph node BDY, visible EXT }. After the positive sample data is determined, a first entity type sequence corresponding to the positive sample data can also be determined, and then, entity words in the electronic medical record text to be trained are randomly combined, so that a second entity type sequence corresponding to a sample generated by random combination is different from the first entity type sequence, the sample generated by random combination is negative sample data, and the generated negative sample data can be exemplified as follows: { bilateral POS, lymph node BDY }, { bilateral POS, neck BDY, visible EXT }, all of which are pairs of information entities with no relation or incomplete relation.

It should be noted that each electronic medical record text to be trained may generate at least one positive sample data, or may generate a plurality of positive sample data and a plurality of negative sample data. In this way, a large amount of sample data is generated, thereby generating a text sample set. And determining a real label corresponding to each sample data while generating the sample data.

Therefore, the medical entity relationship identification needs to define specific medical item categories and corresponding entity identification types under the medical item categories, the identification range is well defined, the difficulty is simpler compared with the openness relationship identification task, and the model can be better identified. In addition, the relation recognition in the medical field is that a plurality of non-adjacent entity words are required to be combined together in a sentence to generate a phrase with medical meaning, and other fields do not have a combination mode of a plurality of non-adjacent entities, and generally, the relation recognition is a triple relation recognition. Therefore, based on the characteristics in the medical field, the multi-group identification can be designed, wherein the multi-group represents a plurality of entity word combinations, and the specific number is determined by relying on the entity type combination rule defined previously.

Therefore, the method for determining the positive and negative sample data through the entity type combination rule can automatically generate the sample data and the real label corresponding to the sample, thereby reducing the manual labeling cost. And general personnel can finish the data marking by presetting entity type combination rules.

In another embodiment provided by the present application, the fine-tuning the pre-training language model based on the text sample set, and determining the fine-tuned pre-training language model as a teacher model includes: adding a corresponding type identifier for each medical entity word and adding a classification identifier for the starting end of each sample to be trained according to the entity type of each medical entity word in the sample to be trained aiming at each sample to be trained in the text sample set; and taking each sample to be trained added with the type identifier and the classification identifier as an input characteristic of the pre-training language model, taking real label data corresponding to each sample to be trained as an output characteristic of the pre-training language model, finely adjusting the pre-training language model, and determining the finely adjusted pre-training language model as a teacher model.

Before the pre-training language model (bert model) is subjected to fine tuning, the scheme provides a construction method of input features of the bert model, and the construction method specifically comprises the following steps: aiming at each sample to be trained (the sample to be trained is positive sample data or negative sample data) of a text sample set, identifying the entity type of each medical entity word included in the sample to be trained, and adding a type identifier to each medical entity word according to the mapping relation between the entity type and the type identifier which is determined in advance; adding a classification identifier at the starting end of a sample to be trained, wherein the classification identifier is used for telling the type of a task executed by a pre-training language model, and the type of the executed task is a classification task; and finally, fine-tuning the bert model based on the text sample set added with the type identifier and the classification identifier to obtain the teacher model.

When the bert model is subjected to fine adjustment, texts in a certain proportion can be selected from the text sample set to serve as training samples, the rest texts serve as test samples to carry out iterative training on the bert model, when the preset iteration times are reached, fine adjustment is stopped, and the result of the bert model with the maximum value of the test samples F1 (the F1 value is the harmonic mean of the accuracy rate and the recall rate) is stored to serve as a teacher model.

When the type identifier is added to each medical entity word, the type identifier corresponding to the medical entity word may be added to the head and tail positions of the medical entity word. By adding the type identifier to each sample to be trained, the entity characteristic enhancement effect is achieved, and the model identification is facilitated.

For example, referring to fig. 3, fig. 3 is a schematic structural diagram of input features of a pre-trained language model constructed according to the present application. Before adding the type identifier and the classification identifier, a bert dictionary of [ unused0], [ unused1] to [ unused N ] is used as the one-to-one corresponding type identifier of each entity type, and the value of N is determined according to the variety number of the entity types. Such as: the dissection part BDY corresponds to an unused0 bert dictionary, the orientation POS corresponds to an unused1 bert dictionary, the existing state EXI corresponds to an unused3 bert dictionary, and each entity type can find the unique bert dictionary corresponding to the existing state EXI. With the unique mapping from the entity type to the bert dictionary, when building input features of the bert model, a pair of [ unusedX ] (X is determined by the entity type) is placed in front of and behind medical entity words to be recognized, and medical entity words are placed in the middle, so that when predicting the relation of multiple tuples (namely whether multiple candidate entities can be combined into a medical phrase or not), each medical entity word is subjected to entity feature enhancement by using a pair of [ unusedX ] (X is determined by the entity type). As shown in fig. 3, CLS represents a classification identifier, U0 is unused0, U1 is unused1, and U3 is unused 3.

In addition, it was found through experiments that the physical feature enhancement mode promoted the f1 value of the model by 2% compared with that of the model without physical feature enhancement.

After the teacher model is determined, the teacher model predicts samples of the text sample set, and soft label data corresponding to each sample is determined. When a sample is predicted by a teacher model (a finely adjusted bert model), vector output of each layer of transform encoder of the bert model can be obtained, wherein the bert model can be a bert-large model and is a 24-layer transform, and a 24 th layer of transform, namely a two-dimensional vector of the output of the last layer of transform, is taken as soft label data.

After determining the soft tag data and the real tag data, the overall loss function, i.e. the distillation loss function, needs to be determined. Here, determining a distillation loss function based on the real tag data and the soft tag data includes: determining a first loss function based on the soft tag data; determining a second loss function based on the genuine tag data; and performing weighted summation on the first loss function and the second loss function to determine the distillation loss function.

And iteratively updating the model parameters of the initial student model by using a reverse gradient propagation algorithm based on the determined distillation loss function until the initial student model converges to obtain an entity relationship identification model. The initial student model is a CNN model, and the initial student model is a light weight model.

Therefore, by adopting the knowledge distillation technology, the CNN effect approaches to the bert model, and the problem of low inference speed of the bert model is solved. And the effect is better than that of the directly trained cnn model under the condition of ensuring that the performance of the model is consistent with that of the directly trained cnn.

Therefore, the medical item classification and the entity relation needing to be identified under the medical item classification are defined based on the priori knowledge in the medical field, and identification becomes simpler based on the specific range, so that a model can learn better; fine adjustment of entity relationship recognition downstream tasks is performed through the bert pre-training language model, and an entity enhancement method is adopted, so that the recognition effect of the bert model is improved; by adopting the knowledge distillation technology, the cnn effect approaches to the bert model, so that the problems of low inference speed of the bert model and low accuracy of cnn identification are solved.

Referring to fig. 4 and 5, fig. 4 is a schematic structural diagram of an apparatus for identifying medical entity relationships according to an embodiment of the present application, and fig. 5 is a second schematic structural diagram of an apparatus for identifying medical entity relationships according to an embodiment of the present application. As shown in fig. 4, the recognition apparatus 400 includes:

the acquisition module 410 is used for acquiring a target electronic medical record text;

the identification module 420 is configured to perform entity word identification on the target electronic medical record text through a predetermined entity word identification model, and identify medical entity words and entity types of each medical entity word included in the target electronic medical record text;

an adding module 430, configured to add an identity and an entity identification to each identified medical entity word; the identity identification comprises a character identity identification of each character in each medical entity word and a type identity identification of each medical entity word determined according to the entity type of each medical entity word, and the entity identity identification is used for determining whether the medical entity word is a word to be identified;

the generating module 440 is configured to sequentially arrange and combine all medical entity words added with the identity identifiers and the entity identification identifiers according to the reading order of the target electronic medical record text, so as to generate a multi-group phrase;

the first determining module 450 is configured to input the multi-component phrase into a pre-trained entity relationship recognition model, and determine a recognition result of the medical entity relationship of the target electronic medical record text; the entity relationship recognition model is a student model which is trained through a knowledge distillation technology and used for medical entity relationship recognition.

Optionally, as shown in fig. 5, the recognition apparatus 400 further includes a model building module 460, where the model building module 460 is configured to:

Optionally, when the model building module 460 is configured to determine the distillation loss function based on the real label data and the soft label data, the model building module 460 is configured to:

determining a first loss function based on the soft tag data;

determining a second loss function based on the genuine tag data;

Optionally, the identifying apparatus 400 further includes a second determining module 470, where the second determining module 470 is configured to:

acquiring a plurality of electronic medical record texts to be trained;

Optionally, when the model building module 460 is configured to perform fine tuning on the pre-training language model based on the text sample set, and determine the fine-tuned pre-training language model as the teacher model, the model building module 460 is configured to:

adding a corresponding type identifier for each medical entity word and adding a classification identifier for the starting end of each sample to be trained according to the entity type of each medical entity word in the sample to be trained aiming at each sample to be trained in the text sample set;

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device 600 includes a processor 610, a memory 620, and a bus 630.

The memory 620 stores machine-readable instructions executable by the processor 610, when the electronic device 600 runs, the processor 610 communicates with the memory 620 through the bus 630, and when the machine-readable instructions are executed by the processor 610, the steps in the method embodiments shown in fig. 1 to fig. 3 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the method embodiments shown in fig. 1 to fig. 3 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for identifying medical entity relationships, the method comprising:

acquiring a target electronic medical record text;

inputting the multi-element phrases into a pre-trained entity relationship recognition model, and determining the recognition result of the medical entity relationship of the target electronic medical record text; the entity relationship recognition model is a student model which is trained to be used for medical entity relationship recognition through a knowledge distillation technology;

constructing the entity relationship identification model by:

2. The identification method of claim 1, wherein determining a distillation loss function based on the real tag data and the soft tag data comprises:

determining a first loss function based on the soft tag data;

determining a second loss function based on the genuine tag data;

3. The recognition method of claim 1, wherein the set of text samples is determined by:

acquiring a plurality of electronic medical record texts to be trained;

4. The identification method according to claim 3, wherein the entity type combination rule is constructed according to a preset medical item; wherein the medical item specifies the entity type of the medical entity word required to be included when having the medical entity relationship.

5. The recognition method of claim 1, wherein the fine-tuning the pre-trained language model based on the text sample set, and determining the fine-tuned pre-trained language model as a teacher model comprises:

6. The method of claim 4, wherein the medical items include symptoms, medications, surgery, scoring sheets, tests, and examinations.

7. An apparatus for identifying medical entity relationships, the apparatus comprising:

the adding module is used for adding an identity identifier and an entity identifier to each identified medical entity word; the identity comprises a character identity of each character in each medical entity word and a type identity of each medical entity word determined according to the entity type of each medical entity word, wherein the entity identity is used for determining whether the medical entity word is a word needing to be identified;

the generating module is used for sequentially arranging and combining all medical entity words added with the identity marks and the entity identification marks according to the reading sequence of the target electronic medical record text to generate a multi-component phrase;

the first determining module is used for inputting the multi-element phrases into a pre-trained entity relationship recognition model and determining the recognition result of the medical entity relationship of the target electronic medical record text; the entity relationship recognition model is a student model which is trained to be used for medical entity relationship recognition through a knowledge distillation technology;

the recognition apparatus further comprises a model building module configured to:

inputting the text sample set into a teacher model and an initial student model respectively to obtain soft label data output by the teacher model; the initial student model is a CNN model;

determining a distillation loss function based on the real label data and the soft label data;

8. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operated, the machine-readable instructions being executable by the processor to perform the steps of the identification method according to any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the identification method according to one of claims 1 to 6.