CN115545021A

CN115545021A - Clinical term identification method and device based on deep learning

Info

Publication number: CN115545021A
Application number: CN202210802174.4A
Authority: CN
Inventors: 杨鹏; 谢亮亮; 李文军; 胡皓楠; 解然
Original assignee: Zhejiang Huaxun Technology Co ltd
Current assignee: Zhejiang Huaxun Technology Co ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-12-30

Abstract

The invention discloses a clinical term recognition method and device based on deep learning, wherein the method comprises the following steps: fine adjustment of a pre-training model, construction of a clinical entity library, context awareness network and term recognition. Firstly, fine tuning is carried out on a clinical data set by using a pre-training model so as to learn a text representation method in the clinical field; then capturing clinical entity words in a professional clinical medical entity dictionary and an online medical database PubMed by means of a crawler program and storing the clinical entity words into a clinical entity database; matching and marking the clinical text by using a clinical entity library to obtain a clinical entity set, vectorizing and characterizing the clinical text and the clinical entity set by using a pre-training model after pre-fine adjustment, and constructing a feature vector for term recognition by associating information of context-aware network modeling terms and context entities through an attention mechanism; and finally, improving the term recognition accuracy through the dependency relationship among the conditional random field CRF learning labels and outputting the recognition result of the clinical terms.

Description

Clinical term identification method and device based on deep learning

Technical Field

The invention relates to a clinical term identification method and device based on deep learning, and belongs to the technical field of Internet and artificial intelligence.

Background

The clinical term Recognition task belongs to a Named Entity Recognition task (NER) in nature, which is the basis of many natural language processing tasks, such as disease reasoning, co-disease detection, clinical diagnosis and other tasks, and accurate Recognition and marking of related entities can help subsequent tasks to be performed efficiently. The full identification of target entities by means of human beings is undoubtedly inefficient and expensive, requiring a large number of professionals, and the exploration of automatic identification of target entities by means of artificial intelligence is therefore of great research and practical interest. Today's named entity recognition task is usually based on complex multi-layer neural networks in the deep learning field, and needs training learning on a large number of labeled data sets to obtain a model which can be used for entity recognition. However, the annotated resources of the clinical domain are quite limited, especially lacking in labeled corpora for term entities. When the traditional neural network model is used for training and learning on small-scale training data, overfitting or underfitting is easy to happen, and semantic information and feature association of context are difficult to capture. Additionally, named entity recognition tasks are performed on clinical records, usually as unstructured free text, and existing models usually perform context vector representation and feature extraction on tagged entities. However, most of the existing labeled data are labeled based on sentences, the context information of the sentences is relatively limited, and it is difficult for the model training to accurately express words when training on such data, and meanwhile, the interaction between entity contexts is also ignored.

The clinical term entities studied in the present invention include, but are not limited to, the following three classes: disease (protem), treatment (treatment), examination item (test) three types of term entities. For example, clopidogrel is a drug effective in heart diseases and can be taken for cardiac examination. There are three types of entities in the sentence, and the information interaction between the treatment entity (Clopidogrel) for the disease entity (heart disease) and the detection entity (heart examination) is helpful to provide effective information support for the model to identify the three types of entities, but the existing named entity identification model lacks the attention to the information interaction between the entities. Information interaction and association among modeling entities can help the model to better understand data. In recent years, with the continuous supplement of medical data, new entities are appearing, and professional dictionaries for explaining and summarizing terms appear. The original clinical record containing the term, the context of which has limited information representative of the term, may assist the model in understanding the meaning of the term by supplementing the meaning attributes of the term with knowledge of the entity dictionary and provide information for the identification of the relevant entity. Therefore, the existing named entity recognition model based on the hierarchical neural network needs a large amount of labeled data sets, but the labeling data for the specific entity recognition task is very limited; the machine learning method based on the rules and the dictionaries is completed by means of artificial feature extraction, and time and labor are consumed when large-scale feature engineering is performed in a targeted mode. The existence of the above situation limits the model performance of the traditional model on the task of clinical term recognition. Meanwhile, the information association of the terms and the context entities can provide help for identifying the terms, and mining implicit characteristics of the terms and the context entities can assist the model in better understanding data, so that better model expression is obtained. The traditional term recognition model lacks the attention to information association between terms and context entities, and limits the improvement of the accuracy of the traditional model on the term recognition task.

Disclosure of Invention

Aiming at the problem that the existing term recognition method is difficult to extract context information under the condition of limited marked resources and influences the improvement of the recognition effect, the invention combines the interaction of learning terms such as a pre-training model and an attention mechanism and the like with the implicit characteristics and information of clinical text context to improve the efficiency of model training and learning. First, the input to the model includes a clinical record and a set of clinical entities; for clinical records, processing the clinical records by using a pre-training model finely adjusted in the clinical field to obtain word-level vector representation, and then fusing information about context by using a bidirectional GRU to obtain a corresponding context feature vector; for the term entity set, an entity library is used for matching entities in clinical texts to obtain a corresponding entity set, and then a word set of each entity in the entity set is used for obtaining vector characterization related to the entity set through the pre-training model and the bidirectional GRU. And then carrying out attention weighted fusion on the context feature vector of the clinical text and the word vector of the entity set through an attention mechanism, splicing weighted fusion information to each context feature vector, and associating the context vector information of the clinical text with the information of the context entity to obtain a feature vector representation for term recognition. And finally, inputting the obtained feature vector representation into a conditional random field CRF, and improving the accuracy of model output and finally outputting a term recognition result by training and learning the front-back dependency relationship among output labels.

In order to achieve the purpose, the invention provides the following technical scheme:

a clinical term identification method based on deep learning comprises the following steps:

step 1: pre-training model trimming

Training and fine-tuning are carried out in a clinical resource database by using a pre-training model, and the text structure knowledge in the clinical field is learned, so that the vectorization representation of the clinical text can be more accurately represented by pre-training;

step 2: construction of clinical entity library

And capturing clinical entity words in a professional clinical medical entity dictionary and an online medical database PubMed by means of a crawler program and storing the clinical entity words into a clinical entity database. Then, identifying each clinical text by means of an entity library to obtain a corresponding entity set, and further segmenting the processed data to form a training set, a verification set and a test set;

and step 3: context aware network

Firstly, vectorizing representation is carried out on each clinical text and the entity set obtained in the step 2 by using the pre-training model after fine tuning in the step 1 on data in a training set to obtain word vector representation of the clinical text and the entity set, and the word vector representation of the clinical text is input into a bidirectional GRU neural network for feature embedding training to obtain context semantic representation of the clinical text. And simultaneously, carrying out the same processing on the entity set to obtain the word context semantic representation of the entity set. Then, performing weighted calculation on the context semantic representation of the clinical text and the context semantic representation of the entity by using an attention mechanism, modeling information correlation between clinical information and context entity information, paying attention to entity information which has key influence on term identification, and finally obtaining a prediction result for term identification through matrix change and an activation function;

step 4 term identification

Inputting the preliminary prediction result obtained in the step 3 into a conditional random field CRF, learning the front-back dependency relationship among output labels, and outputting the most probable term label prediction result.

The invention also provides a deep learning-based clinical term recognition device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the deep learning-based clinical term recognition method when being loaded to the processor.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention trains the word vector characterization model in the clinical field by using abundant un-labeled clinical data, learns the data in the related field to make up the defect that the traditional word vector model cannot be effectively adapted in the professional field with insufficient resources, and ensures the accuracy and the interpretability of the clinical term identification method by supplementing the characteristics of the word vector.

2. The method breaks through the mode of only paying attention to the source clinical text in the past, designs the clinical entity library for identifying the registered entities in the clinical text, effectively pays attention to and extracts the characteristic association between the terms and the context entities through information interaction between the attention mechanism modeling context entities, and further improves the accuracy of identifying the clinical terms.

Drawings

FIG. 1 is a flow chart of a deep learning-based clinical term recognition method provided by the invention.

FIG. 2 is an overall model diagram of the present invention.

FIG. 3 is a context awareness module based on an attention mechanism.

Detailed Description

The technical solutions provided by the present invention will be described in detail with reference to specific examples, which should be understood that the following specific embodiments are only illustrative and not limiting the scope of the present invention.

Example 1: referring to fig. 1 to 3, the invention provides a deep learning-based clinical term recognition method, which includes the following steps: the flow is shown in fig. 1, firstly, a pre-training model is used for fine adjustment on a clinical data set so as to learn a text representation method in the clinical field; then capturing clinical entity words in a professional clinical medical entity dictionary and an online medical database PubMed by means of a crawler program and storing the clinical entity words into a clinical entity database; matching and marking the clinical text by using a clinical entity library to obtain a clinical entity set, vectorizing and characterizing the clinical text and the clinical entity set by using a pre-training model, and constructing a context-aware network modeling term and information association of entities by an attention mechanism to obtain a predicted output label; and finally, learning the dependency relationship among the labels through the conditional random field CRF to improve the accuracy of term identification. Referring to fig. 2 and 3, the detailed implementation steps of the method are as follows:

step 1, fine tuning of a pre-training model, wherein the Bert pre-training model is trained and learned based on a generalized corpus, and the knowledge of a general text structure is learned, so that a conventional word vector expression mode is obtained, but the learning of text expression in a specific field is lacked, the pre-training model can be continuously fine tuned by means of resources in the specific field to obtain the pre-training model with fusion field knowledge, and the vector representation capability in the specific field is improved, which is proved to be effective in a plurality of natural language processing field tasks. The pre-training model for training and learning on the clinical corpus can enable the initial parameters of the model to have knowledge information in the field, and is more beneficial to fine-tuning training and convergence of the target model on a training data set. The invention uses MIMIC-III Clinical database, uses a large amount of unstructured Clinical Records as unlabeled Clinical record data set to carry out fine tuning training on a pre-training model, and obtains a language characterization model (CRBERT) fusing Clinical field knowledge. CRBERTs are mainly used in place of the original berts in the subsequent word vectorization processing as the initial word vectorization tool.

And 2, constructing a clinical entity library. There are many medical entities in clinical records and the necessity of statistics and norms for medical entities has been noted. Such as english-chinese abbreviation dictionary written by zhang xi chen et al and on-line medical knowledge base, can be used to standardize and record commonly used professional vocabulary and clinical terms. The clinical entity library for identifying the target term can be constructed by acquiring registered medical entities in such dictionaries and knowledge bases by means of a crawler program, and the dictionaries and the knowledge bases can provide information such as attributes and interpretations of words, and provide implicit information support for model understanding data and identification of clinical terms. Then, each clinical text is identified by means of an entity library to obtain a corresponding entity set, and finally, all obtained data are processed according to the following steps of 8:1:1, carrying out segmentation to form a training set, a verification set and a test set.

And 3, context awareness network. Firstly, the CRBERT model obtained in step 1 is used for converting clinical text words into serialized vectors H _CRBERT ＝<h ₁ ,h ₂ ,…,h _m >It is then input into the context information of the learned text in the bidirectional GRU neural network to extract the implicit features of the context. After passing through the bidirectional GRU, a context-serialized representation vector H of clinical records is obtained _gru . Therefore, the calculation process of the above-mentioned clinical text context information merged by the bidirectional GRU is as shown in formulas (1) to (3):

for each input text x using the entity library in the aforementioned step 2 _i ＝<x _i1 ,x _i2 ,...,x _in >Matching is performed, and the obtained entity set is denoted as a in the step _i ＝<a _i1 ,a _i2 ,...,a _iL >，a _i Indicating a certain entity matched with L being a _i The number of words matched to the entities, one entity set for each clinical text. To fuse the information in this entity set in the context of clinical records, it is also necessary to input x _i And vectorizing the corresponding entity set for representation. From the perspective of less model parameters and faster training speed, the invention uses bidirectional GRUs to complete the task of entity set vectorization. The module combines the word set a of each entity information in the entity set _i Learning hidden word vector representation of words through a bidirectional GRU network, learning and embedding entity word context in two directions by the bidirectional GRU network, and finally combining the word embedded representations in the two directions to obtain final entity set vectorization representation H _attr P is x _i The total number of terms matched across the entity pool. The calculation process is shown in formulas (4) to (7):

H _attr ＝[H ₁ ,H ₂ ,H ₃ ,…,H _p ]#(7)

contextually vectorizing a representation of obtained clinical text H _gru And vectorized representation of a set of entities H _attr The invention performs weighted fusion on the information of the two based on the idea of attention mechanism. Combining information association relation between clinical text context and entity set, a context perception module based on an attention mechanism is provided, attention is generated on information interaction between context entities, feature association between the context entities is extracted, and feature vector representation H used for term recognition is obtained _text-dict . The calculation of the attention mechanism in the module is shown in formulas (8) to (9):

H _dict ＝softmax[W _q H _gru (W _k H _attr ) ^T ]W _v H _attr #(8)

H _text-dict ＝concat(H _gru ,H _dict )#(9)

and 4, identifying terms. The feature vector representation H for term identification obtained from step 3 _text-dict Performing pooling operation on the output sequence to obtain a predicted output sequence after matrix shape conversion and activation function

The predicted score corresponding to each category label of a word in the clinical record text directly takes the category with the highest score as the predicted category result of the word in the conventional method. However, the predicted output label sequence is obtained based on the mapping association of the sequence of the input text to the result label, and the relation between the result output labels is not learned, which may cause some wrong results. And (3) expressing the constraint relation among the label sequences through the state transition matrix of the sequences, and continuously optimizing parameters to maximize the score of the correct state transition sequence and further improve the term identification accuracy. The above-mentioned meterThe calculation process is shown in formulas (10) to (11):

x denotes each training sample, score (X, y) denotes the correct annotated sequence score,

indicating the probability of a transition between the tags,

indicates the ith tag gets y _i Probability of the label. The scores for all possible marker sequences were then subjected to softmax normalization. Y represents all possible tag sequences, Y represents already labeled sequences, and the predicted tag sequence with the highest score is output through the CRF, thereby outputting the final term recognition result.

Example 2: based on the same inventive concept, the invention provides a deep learning-based clinical term recognition device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the computer program can realize the above deep learning-based clinical term recognition method when being loaded into the processor.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A clinical term recognition method based on deep learning is characterized by comprising the following steps:

step 1, fine adjustment of a pre-training model;

step 2, constructing a clinical entity library;

step 3, context awareness network;

and 4, identifying terms.

2. The deep learning-based clinical term identification method according to claim 1, comprising the steps of:

step 1: fine adjustment of a pre-training model, which is specifically as follows: using a MIMIC-III Clinical database, using a large number of unstructured Clinical Records contained therein as an unlabeled Clinical record data set to perform fine tuning training on a pre-training model to obtain a language characterization model (CRBERT) fusing Clinical domain knowledge, wherein the CRBERT is mainly used for replacing the original BERT as an initial word vectorization tool in the subsequent word vectorization processing.

3. The deep learning based clinical term recognition method according to claim 1, comprising the steps of: and 2, constructing a clinical entity library, specifically, capturing clinical entity words in a professional clinical medical entity dictionary and an online medical database PubMed by means of a crawler program, storing the clinical entity words into the clinical entity library, then, identifying each clinical text by means of the entity library to obtain a corresponding entity set, and further dividing the processed data to form a training set, a verification set and a test set.

4. The deep learning based clinical term recognition method according to claim 1, comprising the steps of: and step 3: the context awareness network comprises the following steps of firstly carrying out vectorization representation on each clinical text and an entity set obtained in step 2 by using a pre-training model subjected to fine tuning in step 1 on data in a training set to obtain word vector representations of the clinical texts and the entity set, inputting the word vector representations of the clinical texts into a bidirectional GRU neural network for feature embedding training to obtain context semantic representations of the clinical texts, simultaneously carrying out same processing on the entity set to obtain word context semantic representations of the entity set, then carrying out weighted calculation on the context semantic representations of the clinical texts and the context semantic representations of the entities by using an attention mechanism, modeling information correlation of the clinical information and the context entity information, paying attention to the entity information which generates key influence in term identification, and finally obtaining a prediction result for the term identification through matrix change and an activation function.

5. The deep learning based clinical term recognition method according to claim 1, comprising the steps of: step 4, term identification, which is specifically as follows: inputting the preliminary prediction result obtained in the step 3 into a conditional random field CRF, learning the front-back dependency relationship among output labels, and outputting the most probable term label prediction result.

6. The deep learning based clinical term recognition method according to claim 3, comprising the steps of: and step 3: the context-aware network specifically comprises the following components: step 3, the context-aware network firstly uses the CRBERT model obtained in step 1 to convert the clinical text words into serialized vectors H _CRBERT ＝<h ₁ ,h ₂ ,…,h _m >Then inputting the context information into the context information of the learning text in the bidirectional GRU neural network to extract the implicit characteristics of the context, and obtaining a context serialization expression vector H of the clinical record after passing through the bidirectional GRU _gru Therefore, the above calculation process of merging the context information of the clinical text by the bidirectional GRU is as shown in formulas (1) to (3):

for each input text x using the entity library in the aforementioned step 2 _i ＝<x _i1 ,x _i2 ,...,x _in >Matching is performed, and the obtained entity set is denoted as a in the step _i ＝<a _i1 ,a _i2 ,...,a _iL >，a _i Indicating a certain entity matched with L being a _i The number of words matched to the entity, one entity set for each clinical text, and the requirement for input x _i Vectorizing the corresponding entity set, completing the vectorization task of the entity set by using a bidirectional GRU (generalized regression Unit), and enabling the module to carry out word set a of each entity information in the entity set _i Learning hidden word vector representation of words through a bidirectional GRU network, learning and embedding entity word context in two directions by the bidirectional GRU network, and finally combining the word embedded representations in the two directions to obtain final entity set vectorization representation H _attr P is x _i The total number of terms matched by the entity library is calculated as shown in formulas (4) to (7):

H _attr ＝[H ₁ ,H ₂ ,H ₃ ,…,H _p ]#(7)

context vectorized representation of acquired clinical text H _gru And entity setsResultant vectorized representation H _attr The invention carries out weighted fusion on the information of the two based on the thought of the attention mechanism, combines the information association relation between the clinical text context and the entity set, provides a context perception module based on the attention mechanism, generates attention on the information interaction between the context entities, extracts the feature association between the context entities, and obtains the feature vector representation H for term recognition _text-dict The calculation of the attention mechanism in the module is shown in formulas (8) to (9):

H _dict ＝softmax[W _q H _gru (W _k H _attr ) ^T ]W _v H _attr #(8)

H _text-dict ＝concat(H _gru ,H _dict )(9)。

7. the deep learning based clinical term recognition method according to claim 5, comprising the steps of: step 4, identifying terms, specifically as follows: the feature vector for term recognition obtained according to step 3 represents H _text-dict Performing pooling operation on the output sequence to obtain a predicted output sequence after matrix shape conversion and activation function

Corresponding to the prediction score of each category label of a word in a clinical record text, the category with the highest score is directly taken as the prediction category result of the word in the conventional method, the problem is solved by means of a conditional random field CRF, the constraint relation among label sequences is expressed by a state transition matrix of the sequences, parameters are continuously optimized, the score of a correct state transition sequence is maximized, and the term recognition accuracy is further improved, wherein the calculation process is shown as formulas (10) to (11):

indicating the probability of a transition between the labels,

indicates the ith tag gets y _i And (3) carrying out softmax normalization on scores of all possible label sequences, wherein Y represents all possible label sequences, Y represents sequences which are already labeled, and outputting a predicted label sequence with the highest score through a CRF (cross-domain similarity), thereby outputting a final term recognition result.

8. An identification device for implementing the deep learning based clinical term identification method according to any one of claims 1 to 7, wherein the identification device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, and when the computer program is loaded into the processor, the identification device implements the deep learning based clinical term identification method, and the identification method comprises the following steps: step 1, fine adjustment of a pre-training model; step 2, constructing a clinical entity library; step 3, context awareness network; and 4, identifying terms.