CN117708338B

CN117708338B - Extraction method and model for Chinese electronic medical record entity identification and four-diagnosis classification

Info

Publication number: CN117708338B
Application number: CN202410162773.3A
Authority: CN
Inventors: 许强; 曾小茼; 刘微微; 赵智慧; 李炜弘; 温川飙; 高原
Original assignee: Chengdu University of Traditional Chinese Medicine
Current assignee: Chengdu University of Traditional Chinese Medicine
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-04-26
Anticipated expiration: 2044-02-05
Also published as: CN117708338A

Abstract

The invention discloses a method and a model for extracting Chinese electronic medical record entity identification and four-diagnosis classification. The method provided by the invention at least comprises the following steps: word segmentation is carried out on the traditional Chinese medicine clinical electronic medical record data so as to obtain training data; inputting training data into the extraction model for training; and inputting the data of the electronic medical record of the traditional Chinese medicine clinic into the trained extraction model to obtain the entity identification result and four-diagnosis classification result of the electronic medical record. The extraction model is a neural network model of a cyclic interaction network configured with a self-attention mechanism, and at least comprises a mixed embedding layer for mapping word segmentation results of electronic medical record data into character embedding feature vectors and context convolution feature vectors and mixing the character embedding feature vectors and the context convolution feature vectors to generate a vector matrix. The invention jointly executes the Chinese electronic medical record entity identification task and the four-diagnosis classification task, adopts character-level convolution feature extraction, fuses a self-attention mechanism, and effectively avoids the error of the Chinese character accurate matching task.

Description

Extraction method and model for Chinese electronic medical record entity identification and four-diagnosis classification

Technical Field

The invention relates to the technical field of medical treatment and information, in particular to a method and a model for extracting Chinese electronic medical record entity identification and four-diagnosis classification based on a character-level convolution cyclic activation network, belonging to the field of electric digital data processing.

Background

Along with the rapid development of medical information technology, electronic medical records become important research objects of modern medicine, and the analysis and mining of traditional Chinese medicine electronic medical records become academic research hotspots, record the processes of patient admission, diagnosis and discharge, and contain rich medical knowledge such as specific symptoms and signs. The development of intelligent medical technology requires the completion of research with the aid of computer assistance. However, the computer can only process structured data, and we need to obtain structured key information from massive electronic medical records. Therefore, it is important to accurately name entity identification and relation extraction for electronic medical records.

When looking, smelling, asking and cutting at the clinic, they have unique actions and are related to each other, they must be organically combined-i.e. four-diagnosis combined with one another can be fully and systematically informed about the disease condition and make a correct diagnosis. Therefore, the extraction of medical entities and the relationship of the medical entities contained in the four-diagnosis information is the key of the extraction of the electronic case information of the traditional Chinese medicine.

At present, the Chinese electronic medical record entity identification method is mainly based on Long short-Term Memory (LSTM) and can learn the dependency relationship between context in natural language, such as a gating circulation unit (Gated Recurrent Unit, GRU), a Bi-directional Long-Term Memory (Bi-directionalLong Short-Term Memory, biLSTM) model, an LSTM-Attention model, an LSTM-CNN model, an LSTM-CRF model, an LSTM-Attention-CRF model, a multitasking self-Attention mechanism BiLSTM-CRF model and the like.

But due to some inherent characteristics of the traditional Chinese medicine electronic medical record: the descriptions of the Chinese medical entity are different in length, no fixed rule exists, the medical entity comprises the description of the Chinese language, and the four-diagnosis relationship is not clear. The resulting named entity recognition process is often combined with chinese segmentation and shallow parsing.

Entity and relationship joint extraction has also made great progress as a fundamental and important task in information extraction. For example, the invention patent application with publication number of CN110444259A discloses a method for extracting entity relations of traditional Chinese medicine electronic medical records based on entity relation labeling strategies. The technical scheme adopted by the patent application of the invention comprises the following steps: firstly, desensitizing information of the traditional Chinese medical electronic medical record, then preprocessing the traditional Chinese medical electronic medical record, converting the original structure of the medical record into a data structure which can be processed by a computer, and removing privacy of a patient and other information irrelevant to entity relation extraction; entity relationship labeling strategies are used for entity relationship joint labeling to acquire entity relationship corpus required by training, and the acquired entity relationship is not necessarily complete and needs to be complemented by a crawler; processing the marked entity relationship by using a Bi-lstm model, inputting the marked corpus training model, and then outputting the extracted entity relationship; and supplementing the disease entity in the obtained entity relationship as a seed crawler to obtain a complete entity relationship. According to the technical scheme, only word vectors and a label input model are used for iterative training, semantic features of Chinese character contexts are not considered, the problem that the Chinese character contexts in an output text do not accord with specific contexts exists, and the possibility that entity relation matching errors are increased exists.

Disclosure of Invention

The invention aims to overcome the problems that the existing traditional Chinese medicine electronic medical record entity identification method only extracts entity types, does not consider the relation between four diagnosis dimensions and four diagnosis and entity type extraction of traditional Chinese medicine, is inconvenient to treat based on syndrome differentiation after extracting symptom entities, does not consider semantic features of Chinese character contexts, does not accord with specific contexts in output texts and the like, and provides a method and model for extracting Chinese electronic medical record entity identification and four diagnosis classification. According to the invention, the entity identification task and the four-diagnosis classification task of the Chinese electronic medical record are jointly executed, and the character-level convolution characteristic is embedded as a key, so that the related semantic characteristic in the Chinese electronic medical record is effectively combined, the representation capability of an extraction model is enhanced, the overall performance of entity identification and relation extraction is further improved, and the accuracy of the entity identification and four-diagnosis classification task is improved.

In order to achieve the above object, the present invention provides the following technical solutions:

A method for extracting Chinese electronic medical record entity identification and four-diagnosis classification. The method at least comprises the following steps:

Word segmentation is carried out on the traditional Chinese medicine clinical electronic medical record data so as to obtain training data;

inputting the training data into an extraction model for training;

inputting the traditional Chinese medical clinical electronic medical record data into the trained extraction model to obtain an entity identification result and a four-diagnosis classification result of the electronic medical record.

Preferably, the extraction model is a neural network model of a cyclic interaction network configured with self-attention mechanisms, comprising at least a hybrid embedding layer. Preferably, the mixed embedding layer maps the word segmentation result of the electronic medical record data into two different dense embedding vectors, and mixes the two dense embedding vectors to generate a vector matrix.

Preferably, the mixed embedding layer maps word segmentation results of the electronic medical record data into character embedding and character-level convolution feature embedding.

Preferably, the extraction model further comprises an input layer for word segmentation and labeling of the text of the Chinese electronic medical record according to a classical bi-directional maximum matching algorithm and an external Chinese medicine dictionary, and the input layer represents each character as a single thermal vector.

Preferably, the extraction model further comprises a task layer, and the task layer processes the vector matrix through a cyclic activation network to obtain a prediction set of entity identification and four-diagnosis classification.

Preferably, the extraction model further includes an output layer, and the output layer extracts a result with the highest probability in the prediction set as an entity identification result and a four-diagnosis classification result.

Preferably, the task layer includes a memory module for converting the vector matrix into shared features.

Preferably, the task layer further comprises an entity identification module and a four-diagnosis classification module. And the entity identification module and the four-diagnosis classification module respectively calculate a prediction set of entity identification and four-diagnosis classification according to the shared characteristics.

Preferably, the entity identification module and the four-diagnosis classification module are both provided with a self-attention mechanism.

The invention also provides an extraction model for entity identification and four-diagnosis classification of the Chinese electronic medical record. The extraction model at least comprises: an input layer, a hybrid embedded layer, a task layer, and an output layer. The input layer is used for word segmentation and labeling of the clinical texts of the traditional Chinese medicine. And the mixed embedding layer maps the word segmentation result of the input layer into character embedding and character-level convolution feature embedding, and the character embedding and character-level convolution feature embedding are mixed to generate a vector matrix. The task layer processes the vector matrix to output a predicted set of entity identifications and four-diagnosis classifications. And the output layer extracts the result with the highest probability in the prediction set as an entity identification result and a four-diagnosis classification result. Preferably, the task layer at least comprises an entity identification module and a four-diagnosis classification module which are configured with a self-attention mechanism.

The invention also provides a construction method of the extraction model. The construction method is used for constructing the extraction model provided by the invention. Specifically, the construction method at least comprises the following steps:

preprocessing the traditional Chinese medicine clinical electronic medical record data to obtain training data of the extraction model;

Constructing the extraction model based on character-level convolution characteristics and a self-attention mechanism;

and carrying out iterative training on the extraction model by utilizing the training data.

Compared with the prior art, the invention has the beneficial effects that:

1. The invention jointly executes the Chinese electronic medical record entity identification task and the four-diagnosis classification task, adopts Chinese character feature extraction, increases Chinese character-level convolution feature extraction, merges a self-attention mechanism, and effectively avoids the error of the Chinese character accurate matching task;

2. The invention adopts character-level convolution feature extraction to effectively combine related semantic features in the Chinese electronic medical record, and enhances the representation capability of the model, thereby further improving the overall performance of entity identification and relation extraction;

3. The invention provides a multi-task combined extraction method for entity identification and four-diagnosis classification based on the relationship of the dimensions of the observation, smelling, asking and cutting of four-diagnosis in traditional Chinese medicine, wherein the corresponding part is the head entity, and the symptom is the tail entity.

Drawings

FIG. 1 is a schematic diagram of an extraction model according to the present invention;

FIG. 2 is a schematic diagram of a data processing flow of a hybrid embedded layer according to the present invention;

FIG. 3 is a schematic diagram of a cyclic network according to the present invention;

FIG. 4 is a schematic diagram of an entity identification module according to the present invention;

FIG. 5 is a schematic diagram of a four-diagnosis classification module according to the present invention;

FIG. 6 is an example of text entity recognition and four-diagnosis classification for Chinese electronic medical records according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should not be construed that the scope of the above subject matter of the present invention is limited to the following embodiments, and all techniques realized based on the present invention are within the scope of the present invention.

Example 1

The embodiment provides a method for extracting Chinese electronic medical record entity identification and four-diagnosis classification. The method at least comprises the following steps:

Inputting training data into the extraction model for training;

and inputting the data of the electronic medical record of the traditional Chinese medicine clinic into the trained extraction model to obtain the entity identification result and four-diagnosis classification result of the electronic medical record.

Preferably, the hybrid embedding layer maps the word segmentation result of the electronic medical record data into character embedding and character-level convolution feature embedding.

Preferably, the invention provides a convolutional network fused with the context of the characters for capturing local characteristics among the contexts of the Chinese characters aiming at the Chinese palace which is difficult to understand in the clinical cases of the traditional Chinese medicine, and the invention fuses a self-attention mechanism, thereby effectively avoiding the error of the task of precisely matching the Chinese characters. Preferably, the invention takes the dimensions of the observation, smelling, asking and cutting of four diagnosis in traditional Chinese medicine as the relation to help the realization of the diagnosis and treatment of the intelligent traditional Chinese medicine.

Example 2

This embodiment is a further improvement of example 1, and the repetition is not repeated. The embodiment provides a method for extracting Chinese electronic medical record entity identification and four-diagnosis classification.

Preferably, the extraction model further comprises an input layer for word segmentation and labeling of the text of the Chinese electronic medical record according to a classical bi-directional maximum matching algorithm and an external Chinese medicine dictionary, and the input layer represents each character as a single heat vector.

Preferably, the extraction model further comprises a task layer, wherein the task layer processes the vector matrix through a cyclic activation network to obtain a prediction set of entity identification and four-diagnosis classification.

Preferably, the extraction model further comprises an output layer, and the output layer extracts a result with the highest probability in the prediction set as an entity identification result and a four-diagnosis classification result.

Preferably, the task layer further comprises an entity identification module and a four-diagnosis classification module. The entity identification module and the four-diagnosis classification module respectively calculate a prediction set of entity identification and four-diagnosis classification according to the shared characteristics.

Example 3

The embodiment provides an extraction model for entity identification and four-diagnosis classification of Chinese electronic medical records. Referring to fig. 1, the classification model preferably includes at least: an input layer, a hybrid embedded layer, a task layer, and an output layer. The input layer is used for word segmentation and labeling of the clinical texts of the traditional Chinese medicine. The mixed embedding layer maps word segmentation results of the input layer into character embedding feature vectors and context convolution feature vectors, and mixes the character embedding feature vectors and the context convolution feature vectors to generate a vector matrix. The task layer processes the vector matrix to output a predicted set of entity identifications and four-diagnosis classifications. The output layer extracts the result with the highest probability in the prediction set as the entity identification result and the four-diagnosis classification result. Preferably, the task layer includes at least an entity identification module configured with a self-attention mechanism and four-diagnosis relation extraction.

Preferably, the user can input the real clinical Chinese text sentence into the input layer of the extraction model, and the extraction model processes the Chinese text sentence and then outputs the entity identification result and the four-diagnosis classification result from the output layer.

Preferably, the input layer performs word segmentation labeling on the text of the electronic medical record to obtain a word segmentation result x= [ X1, X2, X3, and xn ].

Preferably, the input layer performs word segmentation labeling on the text of the electronic medical record according to a classical bi-directional maximum matching algorithm and an external traditional Chinese medicine dictionary D, so as to obtain a word segmentation result x= [ X1, X2, X3, ], xn ].

Preferably, the external Chinese medicine dictionary D can be constructed according to the medical glossary (ICD-11), the ancient book of Chinese medicine "typhoid treatises" and the clinical physical medical dictionary related to the four symptoms of Chinese medicine.

Preferably, in the word segmentation labeling, for the Chinese medical clinical sentence s= [ c1, c2, c3, …, cn ] of the Chinese electronic medical record, the input layer represents each character ct as a single heat vector xt, and the dimension of the single heat vector xt is equal to the vocabulary of the extracted model training data. Preferably, the sentence after completion of the word segmentation notation is expressed as x= [ X1, X2, X3, ], xn ].

Preferably, the character ct is a chinese character in a clinical sentence of a traditional Chinese medicine; the single thermal vector xt refers to the use of N bits 0 or 1 to encode N states, each state having its own representation, and where only one bit is 1 and the other bits are 0. Preferably, the dimension of the single thermal vector xt refers to the number of bits that a single thermal vector represents with 0 or 1. Preferably, the training data refers to data for extraction model training consisting of clinical text of a Chinese electronic medical record, and the vocabulary of the training data refers to all vector dimensions contained in the training data.

Preferably, the input layer inputs the word segmentation result to the hybrid embedding layer. Preferably, the hybrid embedding layer maps the word segmentation result into two different dense embedding vectors and mixes the two dense embedding vectors to generate a vector matrix. Preferably, the dense embedding vector refers to a low-dimensional dense vector generated by embedding a high-dimensional sparse Shan Re vector into a low-dimensional space.

Referring to fig. 1 and 2, the hybrid embedding layer preferably maps the word segmentation result x= [ X1, X2, X3, ], xn to character embedding vc= [ Vc1, vc2, vc3, ], vcn ] and character-level convolution feature embedding vf= [ Vf1, vf2, vf3, ], vfn ]. Preferably, the processing of the mixed embedded layer segmentation result may include:

Word segmentation result x= [ X1, X2, X3, xn) is mapped to character embedding, i.e. each character is embedded into one semantic vector. Preferably, character embedding refers to matching each character xt in the word segmentation result with a character in the lookup table. Preferably, the semantic vector represents a vector of sentences in the clinical text of the chinese electronic medical record.

Mapping the word segmentation result x= [ X1, X2, X3, xn ] into character-level convolution feature embedding, namely setting a sliding window with a specific scale from left to right, traversing each character of the input text, extracting local context features by adopting an adaptive convolution filter based on each window, and connecting the extracted features, so that the context of each character has different convolution feature representations. Preferably, the character-level convolution feature embedding refers to the convolution feature representation of each character that is generated by extracting the context feature of each character along with the character feature and then concatenating the extracted context feature and the extracted context feature.

The two dense embedded vectors are concatenated to generate a vector matrix.

Preferably, mapping the word segmentation result x= [ X1, X2, X3, ], xn to character embedding may comprise:

The word2vec tool is utilized to build a Chinese pre-training character embedded vector lookup table according to dozens of data sets trained by corpora (Baidu encyclopedia, wikipedia, people Japanese 1947-2017, zhi, micro-blogs, literature, finance, ancient Chinese and the like) in various fields, and the lookup table comprises 1291384 characters and digital vectors to which the characters belong. Preferably, the character and the numeric vector thereof refer to a character and a vector representing the character, wherein the character not only comprises Chinese characters, but also comprises English characters, punctuation marks and the like; the vector representing the character, for example, "call" is denoted as "010".

And matching the sentence X= [ X1, X2, …, xn ] with a lookup table, and if a matching item is found, matching a numerical vector corresponding to the character in the lookup table with the character in the word segmentation result, and converting the character in the word segmentation result into a corresponding character to be embedded. Otherwise, a random vector is assigned. Preferably, the random vector refers to a digital vector that is not included in the look-up table.

Thus, the input sequence x= [ X1, X2, X3, ], xn ] is mapped to the character-embedded vector vc= [ Vc1, vc2, vc3, ], vcn ].

Preferably, mapping the word segmentation result x= [ X1, X2, X3, ], xn to the character-level convolution feature embedding may comprise:

Designing and constructing a sliding window to be applied to convolution operation;

The sliding window is suitable for a window containing p characters, taking each character in the sentence as the center of the sliding window, and taking k characters from the front and the back of the window, so that p=k×2+1, and additionally for the first k characters and the last k characters of the sentence boundary, filling with a < pad > identifier when the front and the back of the sentence boundary are not provided with characters.

The sliding window corresponding to the input sequence x= [ X1, X2, X3, ], xn ] may be expressed as win= { win1, win2, … win3}. Then, assuming k=2, the sliding window win1 centered on x1 may be specifically expressed as

win1={<pad>，<pad>，x1，x2，x3}

A convolution filter is utilized to generate a convolution feature.

Preferably, the sliding window size is p, and the convolution window size of the convolution filter is fn is q, where q < p.

The convolution filter convolves the nth character centered sliding window winn for fn to produce a convolution characteristic convn,

convn=winn*fn

Preferably, a convolution signature is generated based on convolution filter f _n. Preferably, assuming a total of m convolution filters fn based on windows winn, one convolution profile CONVn may be generated per window.

CONVn==[conv¹ _n,conv² _n,…conv^m _n]

Wherein CONVn has a data size of m× (n-q+1).

And performing max-over-time pooling operation on the convolution characteristic map. Preferably, the feature with the highest score, i.e. the feature with the highest value, is selected as a retention value of the pooling layer for each convolution feature map, the position and rotation invariance of the feature are kept as an input value of the next layer, and the maximum vfn =max { CONVn } is taken as the feature corresponding to the specific filter. The pooled filtered data were fully concatenated to obtain a character-level convolution feature embedded sequence vf= [ Vf1, vf2, vf3,.. vfn ].

Preferably, the embedding layer fully concatenates the character-embedded vc= [ Vc1, vc2, vc3, ], vcn ] with the character-level convolution feature embedded sequence vf= [ Vf1, vf2, vf3, ], vfn ] to generate the output vector matrix e= [ E1, E2, … en ].

Preferably, the embedding layer output vector matrix e= [ E1, E2, … en ], which consists of en, which is jointly composed of each character-embedded vcn and its corresponding character-level convolution feature-embedded vfn, where en is expressed as:

en=vcn⊕vfn

Wherein E is a vector matrix, where x represents a join operator, dk=d+m, d represents a word vector dimension of a word segmentation result, that is, a dimension of a single hot vector, m represents the number of filters, and E is n×dk.

Preferably, the task layer processes the vector matrix E through a loop activation network to obtain a predicted set of entity identifications and a predicted set of four-diagnosis classifications.

Preferably, the task layer may include a memory module, an entity identification module, and a four-diagnosis classification module.

Preferably, the memory module may be configured as a two-way long and short term memory network BiLSTM.

Preferably, the loop activation network refers to a network model for task layer task activation based on the loop feedback information built by the bidirectional long and short time memory network BiLSTM. Preferably, the task layer builds a loop network based on the entity identification module and the four-diagnosis classification module. Preferably, the entity recognition module is used for recognizing and extracting entities in the text, namely entity recognition tasks; the four-diagnosis classification module is used for classifying the relation of the extraction entity, namely four-diagnosis classification tasks.

Preferably, the vector matrix E output by the hybrid embedding layer is input to the memory module, and the memory module processes the vector matrix E and outputs the shared feature H.

Preferably, the processing of the vector matrix E by the memory module may include: the memory module receives an input vector matrix E from the mixed embedding layer, two LSTMs respectively execute a forward time step and a reverse time step to process the input vector at the same time, and consider the context information of each character in the sentence and return a sequence vector h of each time step. Preferably, the forward time step refers to processing in a front-to-back order of the input vector, and the reverse time step refers to processing in a front-to-back order of the input vector. Preferably, the sequence vector refers to a vector generated by processing in the order of each time step. The sequence vector H of each time step is fully connected to obtain a final BiLSTM output result, namely a shared feature h= { H1, H2,..once., hn }, and the shared feature H is input into the cyclic network. Preferably, the cyclic network processes the shared features H to output a predicted set of entity identities and a predicted set of four-diagnosis classifications.

Referring to fig. 3, the round robin network preferably employs a GRU round robin network. The GRU loop network is provided with o-layer working layers, and each working layer is an interaction layer.

Preferably, each working layer in the GRU circulation network comprises an entity identification module Ce, a four-diagnosis classification module Cr, and two adjacent working layers are connected by two gating circulation units (Gated Recurrent Unit, GRU). Preferably, the two gating cycle units between two adjacent working layers are the entity identification gate control unit GRUe and the four diagnosis classification gate control unit GRUr, respectively. Preferably, the entity recognition gate control unit GRUe is connected to the entity recognition modules Ce of two adjacent working layers, and the four-diagnosis classification gate control unit GRUr is connected to the four-diagnosis classification modules Cr of two adjacent working layers. Preferably, the output results of the entity identification module Ce and the four-diagnosis classification module Cr of the last working layer of the GRU circulation network are the output results of the task layer, namely the prediction set of entity identification and the prediction set of four-diagnosis classification.

Preferably, the input and output processes of the task layer may be: firstly, the embedded vector matrix E is input into a cyclic activation network to obtain a sequence vector H, then the sequence vector H of each time step is fully connected to obtain a shared feature H, the shared feature H is respectively transmitted to an entity recognition module Ce and a four-diagnosis classification module Cr to obtain output feature representations of the two modules, meanwhile, the shared feature H continuously feeds back to obtain an entity recognition task feature vector and a four-diagnosis classification task feature vector based on display interaction between an entity recognition gate control unit GRUe and a four-diagnosis classification gate control unit GRUr, and finally the shared feature H is output and transmitted to the entity recognition module Ce and the four-diagnosis classification module Cr to obtain a prediction set of entity recognition and a prediction set of four-diagnosis classification.

Preferably, explicit interaction based on learning by the GRU network requires that the network is provided with at least two layers, i.e. o=2, 3, …. Preferably, the entity recognition module Ce and the four-diagnosis classification module Cr are both provided with a self-attention mechanism.

Referring to fig. 3, the shared feature H ¹ is preferably fed to the entity recognition module Ce and the four-diagnosis classification module Cr, respectively, to generate output feature representations Y ¹ e and Y ¹ r. Preferably, the shared feature H ¹ =h.

The output characteristic representations Y ¹ e and Y ¹ r are transmitted to the entity recognition gate control unit GRUe and the four-diagnosis classification gate control unit GRUr of the working layer together with H ¹, the respective task characteristic vectors, namely the entity recognition task characteristic vector H ² e and the four-diagnosis classification task characteristic vectors H ²r,H² e and H ² r, are generated through feedback and learning of the entity recognition gate control unit GRUe and the four-diagnosis classification gate control unit GRUr, and are respectively transmitted to the entity recognition module Ce and the four-diagnosis classification module Cr to generate the next output characteristic representation, and meanwhile, the H ² e, the H ² r and the H ¹ are all connected to form the H ² together to be continuously transmitted downwards.

Preferably, the last output feature representation Y ^o-1 e and Y ^o-1 r is delivered to the work layers GRUr and GRUe along with the shared feature H ^o-1, generating task-specific feature vectors H ^o e and H ^o r for the o-layer work layer via feedback and learning of GRUr and GRUe.

Preferably, the shared feature refers to a common feature vector of the entity identification task and the four-diagnosis classification task in the task layer. Preferably, the output feature representation refers to an output vector generated after the entity recognition module Ce and the four-diagnosis classification module Cr respectively input the sharing feature H. Preferably, the task feature vector refers to an output vector generated by the entity recognition gate control unit GRUe and the four-diagnosis classification gate control unit GRUr from the output feature representation and the shared feature. Preferably, the specific task feature of the o-layer refers to the specific task feature vector generated at GRUr and GRUe of the o-layer working layer of the loop activation network.

Preferably, the processing of the shared feature H by the torus network may include the steps of:

Step 1: the entity recognition module Ce and the four-diagnosis classification module Cr are positioned in the whole circulation network, have the properties of circulation feedback and information learning, and are in a serial state, so that the information can be shared. For the entity recognition task and the four-diagnosis classification task, the shared feature H is respectively transmitted to the entity recognition module Ce and the four-diagnosis classification module Cr to generate output feature representations Ye and Yr.

Preferably, step 1 may include:

Step 1.1: the entity identification gate control unit GRUe and the four-diagnosis classification gate control unit GRUr of each layer of working layer of the GRU network generate specific task feature vectors He and Hr based on the shared features h= { H1, H2, …, hn }, and the specific task features of the o-th layer of working layer of the GRU network are H ^o e and H ^o r, and the specific formulas are:

H^or=GRUr(Yr^o-1,H^o-1|θ_GRUr)

H^oe=GRUr(Y^o-1e,H^o-1|θ_GRUe)

Where θ _GRUr and θ _GRUe are network parameters of GRUr and GRUe, respectively.

Step 1.2: the shared feature H ^o of the o-layer work layer of the GRU network is the sum of the o-layer task-specific features H ^o e and H ^o r and the previous shared feature H ^o-1, expressed as:

H^o=H^oe+H^or+H^o-1

Step 1.3: explicit interaction based on learning in a GRU network requires that the network is allowed to have at least two layers, i.e. o=2, 3, …. The output formulas of the output characteristic representations Y ^o e and Y ^o r of the entity identification module Ce and the four diagnosis classification module Cr of the o layer are as follows:

Y^oe=Ce(H^o)

Y^or=Cr(H^o)

Y^oe=:{(y^oe)n|h^oen∈H^o}

Y^or=:{y^or(1,n)|h^or1_,h^orn∈H^o}

H ^o en is a feature vector of any specific entity identification task in the o layer of the GRU network; (y ^o e) n is the output characteristic representation obtained by the entity identification module Ce of the o layer of the GRU network according to h ^o en, and the output characteristic representation represents the probability distribution of characters in the entity identification task represented by h ^o en in the o layer of the GRU network; h ^o r1 and h ^o rn are feature vectors of any two four-diagnosis classification tasks in the o-th layer of the GRU network; y ^o r (1, n) is the output characteristic representation obtained by the four-diagnosis classification module Cr of the o-th layer of the GRU network according to h ^o r1 and h ^o rn, and represents the probability distribution of character pairs formed by characters in two four-diagnosis classification tasks represented by h ^o r1 and h ^o rn in the o-th layer of the GRU network.

Thus, the entity recognition module output feature representation Ye and the four-diagnosis classification module output feature representation Yr generated by all the working layers can be expressed as:

Ye=Ce(H)

Yr=Cr(H)

Step 2: the output feature representations Ye and Yr are supplied as inputs to the entity recognition gate control unit GRUe and the four-diagnosis classification gate control unit GRUr together with the shared feature H to generate an entity recognition task feature vector He and a four-diagnosis classification task feature vector Hr.

Preferably, step 2 may include:

step 2.1: GRUe takes as input the output feature representation y e Ye and the shared feature H, and calculates an entity recognition task feature vector hen e He, which can be expressed as:

z=δ(Wz(H⊕y))

u=δ(Wu(H⊕y))

hen=(1-z)*H+z*tanh(Wo(u*H)⊕y)

Where ∈is the join operator, wz, wu, wo is the learnable parameter of the GRU network, δ () is the sigmoid activation function, and tanh () is the activation function of the hyperbolic tangent function.

Step 2.2: GRUr the same processing as GRUe is used to input the output characteristic representation y e Yr and the shared H, and the four-diagnosis classification task characteristic vector hrn e Hr is calculated.

Step 3: the entity recognition module Ce performs key feature extraction by using a self-attention mechanism, and calculates probability distribution yn of characters cn in the entity recognition task.

Referring to fig. 4, preferably, the processing steps of the entity recognition module Ce may include:

Step 3.1: the specific entity recognition task feature vector hen is taken as input and a weight vector aen is output, which can be expressed by the formula:

aen=softmax(We2tanh(We1h^Ten))

Where We1, we2 are trainable vectors. softmax () is an activation function for a multi-class classification problem, which ensures that all weights add to 1, tanh () is an activation function of hyperbolic tangent, h ^T en denotes vector transposition of hen.

Step 3.2: according to the weight vector aen, the feature vector Men after the key feature extraction is obtained,

Men=aen*hen

Step 3.3: after being extracted by the self-attention mechanism, the finally generated global feature vector of the entity recognition task can be expressed as Me,

Me={Me1,Me1,…,Men}

Step 3.4: based on the entity recognition task global feature vector Me, calculating probability distribution yn of the character cn, wherein the specific formula is as follows:

yn=softmax(WeMe+be)

where We, be are trainable vector weights.

Step 4: the four-diagnosis classification task can be interpreted as a binary classification problem, by classifying the character pairs (c 1, cn) to determine the values of the relation triples < c1, t, cn >, wherein t refers to the relation among entities, namely, four relations of 'look, smell, ask and cut', and the task can be regarded as learning the probability distribution y (1, n) of each character pair (c 1, cn). The four-diagnosis classification module Cr adopts a self-attention mechanism to extract key features, and calculates probability distribution y (1, n) of character pairs (c 1, cn) in the entity recognition task. The character pair (c 1, cn) refers to a character pair composed of any two characters in the sentence s= [ c1, c2, c3, …, cn ].

Referring to fig. 5, preferably, the processing steps of the four-diagnosis classification module Cr may include:

Step 4.1: the four-diagnosis classification module Cr adopts a self-attention mechanism to carry out attention weight distribution on the task feature vector hrn according to the features of Hr, and outputs a weight vector arn.

Step 4.2: calculating the feature vector q (1, n) of the character pair (c 1, cn):

q(1,n)=φ(Wq(hr1⊕hrn))

Wherein hr1 is a four-diagnosis classification task feature vector of character c1, hrn is a four-diagnosis classification task feature vector of character cn, φ () is a ReLU (RECTIFIED LINEAR Unit) activation function, and # -is a join operator representing a join operation, wq is a trainable parameter.

Step 4.3: the self-attention weight value of the character pair (c 1, cn) is obtained by adopting a self-attention mechanism, the characteristic vector q (1, n) of the character pair (c 1, cn) is taken as input, and the weight vector ar (1, n) is output, and the self-attention weight value can be expressed as follows by a formula:

ar(1,n)=softmax(Wr2tanh(Wr1q^T(1,n)))

Where Wr1, wr2 is a trainable vector, and q ^T (1, n) represents the vector transpose of q (1, n).

Step 4.4: obtaining a feature vector Mr (1, n) after key feature extraction according to the weight vector ar (1, n),

Mr(1,n)=ar(1,n)*hr(1,n)

Step 4.5: calculating the probability distribution y (1, n) of the character pairs (c 1, cn),

y(1,n)=δ(Wr3*Mr(1,n)+br)

Where Wr3 is a trainable vector, br is a bias coefficient, mr (1, n) is a global feature vector of the character pair (c 1, cn) processed by the self-attention mechanism, and δ () is a sigmoid activation function.

Preferably, the task layer generates a probability distribution Yn of a number of characters cn and a probability distribution Y (1, n) of a number of character pairs (c 1, cn), thereby obtaining a prediction set Yn of entity recognition tasks and a prediction set Y (1, n) of four-diagnosis classification tasks. Preferably, the probability distribution Yn e Yn is the output result of the entity recognition task, and the probability distribution Y (1, n) e Y (1, n) is the output result of the four-diagnosis classification task.

Preferably, the entity recognition module Ce of the last working layer of the GRU circulation network outputs probability distribution Yn of a plurality of characters cn to form a prediction set Yn of entity recognition tasks, and the four-diagnosis classification module Cr of the o-th working layer outputs probability distribution Y (1, n) of a plurality of character pairs (c 1, cn) to form a prediction set Y (1, n) of four-diagnosis classification tasks.

Preferably, the output layer screens out the predicted set Yn of the entity recognition task as the entity recognition result with the highest probability, screens out the predicted set Y (1, n) of the four-diagnosis classification task as the four-diagnosis classification result with the highest probability, and generates the structured electronic medical record.

Preferably, the output layer screens out the predicted set with the highest probability as the entity recognition result, and screens out the predicted set with the highest probability as the four-diagnosis classification result. Referring to fig. 6, the output layer preferably outputs structured electronic medical record text sentences, such as "get god, breathe evenly, tongue dark, white coating, pulse thin.

Preferably, the clinical Chinese text sentence "patient is refreshing, moderate in shape, clear and powerful in voice, even in breath, dark tongue with white coating, wiry pulse" and "patient is refreshing, with pain in lower abdomen, pale red tongue with thin white coating, and thin and slow pulse" is input into the extraction model provided in this embodiment and is compared with the baseline model RIN (Recurrent Interaction Network), wherein the baseline model RIN is not configured with character convolution feature embedding and self-attention mechanisms.

Preferably, the comparison experiment step comprises:

step 1: the true clinical Chinese text sentences of the same set are input into the input layers of the extraction model and the RIN model, and word segmentation labeling is performed on the sentences of the clinical Chinese texts of the Chinese medicine at the Chinese character layer.

Step 2: the input layer inputs the data after word segmentation and labeling to the embedding layer, maps the data into an embedding vector, and generates an output vector matrix.

Step 3: the embedded layer output vector matrix E is input to the BiLSTM layers and the shared feature H is output.

Step 4: and performing entity extraction and four-diagnosis classification task processing by using the shared characteristic H, and outputting entity identification and four-diagnosis classification prediction sets.

Step 5: and extracting a result with the highest probability in the prediction set as a text output result, and outputting the structured electronic medical record data.

Preferably, the results of the comparative experiments are shown in Table 1.

TABLE 1

In table 1, sentence is a text sentence of clinical traditional Chinese medicine, golden Result is a correct output Result, and Result is an output Result of the RIN model and the extraction model (CCRIN) provided in this embodiment.

According to the result of the comparison experiment, compared with the prior art such as RIN model, the extraction model provided by the embodiment can effectively extract the Chinese characters of the Chinese electronic medical record and the context thereof through character-level convolution feature extraction, and gain entity identification and key feature identification of four-diagnosis relation classification tasks; the extraction model provided by the embodiment introduces a self-attention mechanism to the entity identification module and the four-diagnosis classification module respectively, so that the association relationship between the task feature vector captured by the circulation network and the final task is effectively improved, the association between the task feature vector and the final task is enhanced, the mutual gain between the symptom entity and the four-diagnosis feature relationship is assisted, and the efficient and accurate identification and four-diagnosis classification of the Chinese electronic medical record entity are realized.

Example 4

The present embodiment provides a method for constructing an extraction model, which is used for constructing the extraction model related in embodiment 3. The construction method can comprise the following steps:

Preprocessing the traditional Chinese medicine clinical electronic medical record data to obtain training data of an extraction model;

constructing an extraction model based on the character-level convolution characteristics and the self-attention mechanism;

And carrying out iterative training on the extraction model by using training data.

Preferably, the preprocessing of the electronic medical record data in the medical clinic according to the embodiment may include:

the external Chinese medicine dictionary D is constructed according to the medical glossary (ICD-11), the ancient book of Chinese medicine (typhoid fever theory) and the clinical entity medical dictionary related to the four symptoms of Chinese medicine.

Given a Chinese medicine clinical text sentence S, a classical bidirectional maximum matching algorithm is used for word segmentation and labeling of the sentence S according to an external Chinese medicine dictionary D.

After the automatic labeling is performed by using the algorithm, the labeling result is verified and corrected by related personnel in the medical background. The standard treatment is carried out on the dimension of the syndrome and the symptom part of the text according to the basic theoretical terms of traditional Chinese medicine of GB/T20348-2006 and the clinical diagnosis and treatment terms of traditional Chinese medicine of GB/T16751.2-1997, namely the syndrome part. Preferably, the standard tabulation process refers to that symptoms with similar meanings are used with the same symptom specification name, symptom parts are used with the same specification name, and the same symptom dimension is processed according to the same sentence-breaking mode.

During labeling, the syndrome dimension of 'inspection, smelling, asking and cutting' is regarded as four types of relation categories, the checking part is a head entity, the description sentence of the part is a tail entity, the related symptoms and the symptom attribute entity are classified in the four diagnosis relation to which the symptom attribute entity belongs by combining BIOES label scheme, and the labeling result of the sentence S of the clinical data of the traditional Chinese medicine is shown in the table 2 through standardized and structured operation, so that the training data of the extraction model is obtained. Referring to Table 2, the training data preferably includes the Chinese electronic medical record clinical text and its entity recognition results and four-diagnosis classification results.

TABLE 2

Preferably, S-T represents a single tag tail entity, S-H represents a single tag head entity, B-H represents a multi-tag start head entity, E-H represents a multi-tag end head entity, B-T represents a multi-tag start tail entity, and E-T represents a multi-tag end tail entity.

Where a single token refers to a word represented by one word and multiple tokens refer to words represented by two or more words.

"Gas uniformity" may be denoted as S-H B-T E-T and "weak gas" may be denoted as B-HE-H S-T, with more than two words, when they are head entities, being denoted by a plurality of B-H and one E-H, e.g., B-H B-H E-H, B-H B-H B-H E-H. When more than two words are tail entities, these are denoted by a plurality of B-T and one E-T, e.g., B-T B-T E-T, B-T B-T B-T E-T.

Preferably, the architecture of the extraction model is the same as that provided in embodiment 2.

Preferably, the training data is input into the extraction model for iterative training. Preferably, the parameter optimization is performed during the training process using an Adam optimizer pair.

Preferably, the effect of the extraction model is evaluated by using the loss function in the training process, and when the loss function is no longer reduced, or the effect is no longer improved on the training data set, the extraction model can be regarded as completing training.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A Chinese electronic medical record entity identification and four-diagnosis classification extraction method is characterized by at least comprising the following steps:

inputting the training data into an extraction model for training;

inputting the traditional Chinese medical clinical electronic medical record data into the trained extraction model to obtain an entity identification result and a four-diagnosis classification result of the electronic medical record;

wherein the extraction model is a neural network model of a cyclic interaction network configured with self-attention mechanisms, comprising at least a hybrid embedding layer,

The mixed embedding layer maps word segmentation results of the electronic medical record data into two different dense embedding vectors, and mixes the two dense embedding vectors to generate a vector matrix;

the mixed embedding layer maps word segmentation results of the electronic medical record data into character embedding and character-level convolution feature embedding; the character-level convolution feature embedding means that the context feature of each character and the character feature are extracted together and then connected to generate a convolution feature representation of each character;

The extraction model further comprises a task layer, wherein the task layer processes the vector matrix through a cyclic activation network to obtain a prediction set of entity identification and four-diagnosis classification;

The task layer comprises an entity identification module and a four-diagnosis classification module, wherein the entity identification module is used for identifying and extracting an entity in a text, namely an entity identification task; the four-diagnosis classification module is used for classifying the relation of the extraction entity, namely four-diagnosis classification tasks;

The four-diagnosis classification task is a binary classification problem, and the values of the relation triples < c1, t and cn > are determined by classifying character pairs (c 1 and cn), wherein c1 refers to a head entity, cn refers to a tail entity, and t refers to the relation among the entities, namely four relations of looking, smelling, asking and cutting.

2. The method for identifying and extracting the entities of the electronic medical records in four diagnosis and classification according to claim 1, wherein the extraction model further comprises an input layer for word segmentation and labeling of the text of the electronic medical records in Chinese according to a classical bi-directional maximum matching algorithm and an external Chinese medical dictionary, and each character is represented as a single hot vector by the input layer.

3. The method according to claim 1, wherein the extraction model further comprises an output layer, and the output layer extracts the result with the highest probability in the prediction set as the entity recognition result and the four-diagnosis classification result.

4. The method of claim 1, wherein the task layer further comprises a memory module for converting the vector matrix into shared features.

5. The method according to claim 4, wherein the entity recognition module and the four-diagnosis classification module calculate a prediction set of entity recognition and four-diagnosis classification according to the shared features, respectively.

6. The method of claim 5, wherein the entity recognition module and the four-diagnosis classification module are each provided with a self-attention mechanism.

7. An extraction device, comprising an extraction model, which is used for carrying out entity identification and four-diagnosis classification on a Chinese electronic medical record, and is characterized in that the extraction model at least comprises:

The input layer is used for word segmentation and labeling of the clinical texts of the traditional Chinese medicine;

the mixed embedding layer is used for mapping word segmentation results of the input layer into character embedding and character-level convolution feature embedding and mixing the character embedding and the character-level convolution feature embedding to generate a vector matrix; the mixed embedding layer maps word segmentation results of the electronic medical record data into character embedding and character-level convolution feature embedding; the character-level convolution feature embedding means that the context feature of each character and the character feature are extracted together and then connected to generate a convolution feature representation of each character;

The task layer is used for processing the vector matrix through a cyclic activation network to output a prediction set of entity identification and four-diagnosis classification;

The output layer is used for extracting a result with the highest probability in the prediction set as an entity identification result and a four-diagnosis classification result;

the task layer at least comprises an entity identification module and a four-diagnosis classification module, wherein the entity identification module is configured with a self-attention mechanism;

The entity recognition module is used for recognizing and extracting an entity in the text, namely an entity recognition task; the four-diagnosis classification module is used for classifying the relation of the extraction entity, namely four-diagnosis classification tasks;

8. A method for constructing an extraction model according to claim 7, wherein the method comprises at least: