CN114638214A - Method for identifying Chinese named entities in medical field - Google Patents

Method for identifying Chinese named entities in medical field Download PDF

Info

Publication number
CN114638214A
CN114638214A CN202210268640.5A CN202210268640A CN114638214A CN 114638214 A CN114638214 A CN 114638214A CN 202210268640 A CN202210268640 A CN 202210268640A CN 114638214 A CN114638214 A CN 114638214A
Authority
CN
China
Prior art keywords
embedding
layer
pos
bert
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210268640.5A
Other languages
Chinese (zh)
Inventor
陈洪辉
江苗
王梦如
蔡飞
舒振
宋城宇
张鑫
陈翀昊
邵太华
郑建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210268640.5A priority Critical patent/CN114638214A/en
Publication of CN114638214A publication Critical patent/CN114638214A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for identifying a Chinese named entity in the medical field, which uses a BBCPR model to identify the Chinese named entity in the medical field, wherein the BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer, the word embedding layer converts a given sentence into words to be embedded and inputs the words into the BERT embedding layer; the BERT embedding layer adopts an MCBERT coder to obtain BERT output embedding; the POS fusion layer connects the BERT output embedding and the POS embedding to obtain fusion embedding; the BilSTM layer encodes the fusion embedding to obtain the final implicit expression of the input sequence; and the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence. The method can definitely learn word boundary information, and meanwhile, can solve the over-fitting problem and enhance the robustness of the model on small data.

Description

Method for identifying Chinese named entities in medical field
Technical Field
The invention belongs to the technical field of artificial intelligence and digital medical treatment, and particularly relates to a Chinese named entity identification method in the medical treatment field.
Background
Named Entity Recognition (NER) is a core task of Natural Language Processing (NLP) that aims at identifying potential entities and their classes from unstructured text. As an important component of many NLP downstream tasks (such as relationship extraction and information retrieval), NER has been a hot problem in the NLP world and attracts many people's attention. In general, previous work has mostly been for NER tasks in English and has achieved good performance by integrating character-level features.
East asian languages (such as chinese) typically lack explicit word boundaries and have complex formation, which is more challenging to NER models than NERs in english. For example, the performance of the model (SOTAs) which is best performed in the chinese NER task is far lower than the SOTAs of the english NER task, and there is a gap of nearly 10% in the F1 evaluation index. Furthermore, recent research has focused more on domain-specific NERs, such as medicine, which are complex and require expertise in the external domain.
Chinese Named Entity Recognition (CNER) in the medical field is known as a character-level sequence label problem, while english is word-level. Recently, deep learning methods are widely used for CNER tasks because they have excellent ability to automatically extract features from mass data. For example, previous work typically utilized a two-way long-short-term memory (BilSTM) network to capture sequence information and achieve comparable results. Furthermore, because of the superior ability of language models to extract context information, converter-based models (such as BERT) are becoming a new paradigm for CNER.
Especially in the medical field, external expertise facilitates the model to understand the technical terms and boundaries of recognized words, which has prompted recent research to add dictionary knowledge on the basis of the traditional BilSTM-CRF or BERT architecture. However, high quality dictionary construction typically requires a lot of time and expertise, which is very expensive and labor intensive. Furthermore, these dictionary-based approaches may reduce the versatility and robustness of the NER model. At the same time, annotated chinese medical NER data is difficult to obtain and is generally small in scale due to privacy, ethical, and highly specialized restrictions. The scale is small, and the overfitting problem of the model is easily caused.
Disclosure of Invention
In order to solve the problems, the invention provides a method for improving the Chinese named entity recognition in the medical field by using part of speech information and a new regularization method. The method uses BBCPR (BERT-BilSTM-CRF with POS and Regularization) model to identify named entities in the medical field, the BBCPR model utilizes POS (language) fusion layer to incorporate external grammar knowledge, and introduces a novel READ (simulation and Adversal training and drop) method to improve the robustness of the model. The BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer.
The word embedding layer converts a given sentence into word embedding E and inputs it to the BERT embedding layer.
And the BERT embedded layer adopts an MCBERT encoder to obtain BERT output embedding and is used as POS fusion layer input.
The POS fusion layer connects the BERT output embedding and the POS embedding to obtain fusion embedding, and the POS embedding is obtained by POS tags of given sentences through the POS embedding layer.
The BilSTM layer encodes the fusion embedding to obtain the final implicit representation of the input sequence.
And the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence.
The BBCPR model uses a regularization module with antagonism training and disclaimer that generates an antagonistic word embedding E ' using the antagonistic perturbation generated by FGSM on word embedding E ', the word embedding E and the antagonistic word embedding E ' each generate two different sub-modules through a Dropout mechanism and output two different model prediction distributions, and then minimizes the bi-directional KL divergence between the two prediction distributions to reduce the prediction difference of the two sub-models.
The minimizing of the bi-directional KL divergence between the two prediction distributions reduces the prediction difference of the two submodels, the training objective being to make the loss function of the data (X, Y)
Figure BDA0003553527040000021
And (3) minimizing:
Figure BDA0003553527040000022
Figure BDA0003553527040000023
Figure BDA0003553527040000024
wherein λ is coefficient weight, P (Y | X) is probability distribution of word embedding E, P '(Y | X) is probability distribution of adversity word embedding E', DKLAnd the divergence is KL, X is a prediction sentence, and Y is a label sequence output by the BBCPR model.
The POS fusion layer connects BERT output embedding and POS embedding, and specifically comprises the following steps:
Figure BDA0003553527040000025
wherein v isiFor concatenation of BERT output embedding and POS embedding, hiFor output embedding of BERT, piEmbed the ith tagged POS in the sentence.
Further, the given sentence is passed through a lac (local Analysis of chinese) tool to obtain a POS tag, which is fed into a POS embedding layer to obtain a POS embedding.
Further, a BIO tag (Begin, Inside, out) is used to predict each marker in the sentence.
Further, the MCBERT comprises a stack of L identical layers, each layer comprising two sublayers, wherein the first sublayer is a multi-headed self-attention mechanism, the second sublayer is a fully-connected feedforward neural network, and the two sublayers are connected in sequence by residual connection and layer normalization.
Further, the calculation process of BilSTM is as follows:
Figure BDA0003553527040000031
Figure BDA0003553527040000032
Figure BDA0003553527040000033
wherein
Figure BDA0003553527040000034
And
Figure BDA0003553527040000035
representing forward LSTM and backward LSTM, respectively.
The invention designs a POS fusion module based on MCBERT combined with BilSTM-CRF, which brings POS information into a model and can be used for guiding the model to definitely learn word boundary information of a CNER task in the medical field. Meanwhile, a new regularization method is provided to regularize the model output on the antagonism training and the conjugate disturbance samples to solve the overfitting problem and enhance the robustness of the model on small data. The invention performs comprehensive experiments on cMedQANER and cEHRNER data sets. The results show that the BBCPR model outperforms several competitive baselines, while ablation studies demonstrate the effectiveness of the proposed method.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 is an overall architecture diagram of a BBCPR model.
FIG. 2 is a block diagram of a READ regularization method.
FIG. 3 is a schematic diagram of word embedding.
FIG. 4 is a graph of the effect of BBCPR model on the cMedQANER and cEHRNER test sets.
FIG. 5 is a graph of the effect of POS embedding size on the cMedQANER and cEHRNER test set.
FIG. 6 is a graph of the effect of the regularization loss function weight λ on the cMedQANER and cEHRNER test sets.
FIG. 7 is a graph of the effect of drop rate droperforate on the cMedQANER and cEHRNER test set.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
As shown in FIG. 1, the BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer.
The word embedding layer converts a given sentence into word embedding E and inputs it to the BERT embedding layer.
And the BERT embedded layer adopts an MCBERT encoder to obtain BERT output embedding and is used as POS fusion layer input.
The POS fusion layer connects the BERT output embedding and the POS embedding to obtain fusion embedding, and the POS embedding is obtained by POS tags of given sentences through the POS embedding layer.
The BilSTM layer encodes the fusion embedding to obtain the final implicit representation of the input sequence.
And the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence.
The CNER task in the medical field is designed to identify and predict entities in text (e.g. diseases, symptoms, drugs, etc.). The present invention regards the CNER task in the medical field as a sequence tagging problem. Given a sentence X ═ X1,x2,...,xnThe goal is to predict each marker X in sentence X with a BIO tag (Begin, Inside, Outside)iAnd obtaining a tag sequence Y ═ Y1,y2,...,ynAs output. Examples of tagged entities in sentences are shown in Table 1, where B-s represents the onset of a physical symptom, I-s represents the interior of a physical symptom, B-d represents the onset of a physical disease, I-d represents the interior of a physical disease, B-p represents the onset of a physical character person, I-p represents the interior of a physical character person, and O represents an external entity.
Table 1 entity labels in sentences
Figure BDA0003553527040000041
For clarity, the main symbols used are summarized in table 2.
Table 2 prime notation used herein
Figure BDA0003553527040000042
Figure BDA0003553527040000051
BERT is a pre-trained model with multi-layered bidirectional transducer coders that obtain deep language tokens through joint learning of layers and context. MCBERT is a further training of BERT on the chinese medical corpus, with the same structure as BERT. It comprises a stack of N identical layers. Each layer comprises two sublayers, wherein the first sublayer is a multi-head self-attention mechanism, and the second sublayer is a fully-connected feedforward neural network. The two sub-layers are connected in sequence by residual connection and layer normalization.
The word embedding layer first converts a given sentence X into a series of word embeddings E ═ E1,e2,...,enMark embedding, segment embedding, and position embedding, as shown in fig. 2, simply.
The BERT embedded layer adopts an MCBERT coder, and the input of the MCBERT coder is E. For convenience, the input of the first layer is denoted as H0Let the output of the first layer be Hl. The output of the previous layer represents Hl-1Is fed into the multi-headed self-concern (MSA) sublayer to obtain a context-level representation. Output representation of the network (FFN) sublayer. These operations are expressed as:
H′l=LayerNorm(MSA(Hl-1)+Hi l-1)
Hl=LayerNorm(H′l+FFN(H′l))
final BERT output embedding { h1,h2,...,hnIs sent to the POS blending layer.
The POS fusion layer incorporates the features of the speech part into the model, which links the BERT output embedding with the POS embedding, which is derived from the POS tag of a given sentence and through the POS embedding layer, resulting in a fused embedding. Since BERT treats text as a markup, and generates markup level inlays. In chinese, however, words are often used as the smallest unit for expressing semantics. While the label-based model can avoid segmentation errors, it loses part of the semantics while also increasing the difficulty of entity boundary extraction. POS refers to word features that are the basis for word classification, including nouns, verbs, adjectives, modal particles, and so on. POS can greatly improve the annotation of NER.
For a given sentence X ═ X1,x2,...,xnObtaining POS tag T ═ T with Baidu LAC tool1,t2,...,tn}. Then, T is sent into a POS embedding layer to obtain POS embedding P ═ P1,p2,…,pnTherein of
Figure BDA0003553527040000052
dpIndicating the size of the POS inlay. Embedding BERT output into h by POS fusion layeriAnd POS embedding piThe connection of the two-way pipe is realized,
Figure BDA0003553527040000053
wherein
Figure BDA0003553527040000061
The BilSTM layer can capture more comprehensive context information, and the application introduces the BilSTM to encode fusion embedding, which consists of a forward LSTM and a backward LSTM. And the LSTM can more accurately capture the long-distance dependency relationship, and the problem of gradient disappearance or gradient explosion caused by the standard RNN is avoided. Thus, BilSTM can better capture long-range and bi-directional semantic dependencies by learning context information in both directions. An LSTM unit at time t may be defined as follows:
it=σ(Wi[ht-1,vt]+bi)
ft=σ(Wf[ht-1,vt]+bf)
Figure BDA0003553527040000062
ot=σ(Wo[ht-1,vt]+bo)
Figure BDA0003553527040000063
ht=ot⊙tanh(Ct)
wherein v istAnd htThe input vector representing time t and the hidden state. Wi、Wf、Wc、WoAnd bi、bf、bc、boIs a learnable parameter of LSTM. σ is a sigmoid function.
Figure BDA0003553527040000064
A dot product function is represented. i.e. itIndicating the input gate, determines the storage of the information at time t. f. oftIndicating a forgetting gate that determines the information that was discarded at the previous time. otAn output gate is shown, which determines the output of the information.
Figure BDA0003553527040000065
And CtCell states representing candidate cell states and time t, respectively. The calculation process for BilSTM is represented as follows:
Figure BDA0003553527040000066
Figure BDA0003553527040000067
Figure BDA0003553527040000068
wherein
Figure BDA0003553527040000069
And
Figure BDA00035535270400000610
representing forward LSTM and backward LSTM, respectively. The final implicit representation of the input sequence is represented as
Figure BDA00035535270400000611
Wherein d isLSTMIndicating the hidden size of the LSTM.
In the named entity recognition task, there are order relationships and some constraint rules between adjacent tags, e.g., an I-symptom tag must appear after a B-symptom tag. BilSTM focuses on long-term context information rather than on dependencies between tags, while CRF can model sequential relationships between tags by learning neighboring relationships. The present invention utilizes a standard CRF layer to decode the final sequence tags.
The output H of BiLSTM is first converted to the input matrix P of CRF by a linear function.
P=WpH+bp
Wherein
Figure BDA0003553527040000071
Figure BDA0003553527040000072
Is a learnable parameter, and k is the number of tag types. The purpose of the CRF is to calculate a random variable sequence Y ═ Y1,y2,...,yn-conditional probability distribution P (Y | X). In a given sentence X ═ X1,x2,...,xnThe probability Y of its final optimal tag sequence can be calculated as:
Figure BDA0003553527040000073
Figure BDA0003553527040000074
wherein
Figure BDA0003553527040000075
A tag sequence representing ground truth. Y isXRepresenting the set of all possible tag sequences.
Figure BDA0003553527040000076
Indicating slave label yiTo label yi+1The transition probability of (2). And the transition probability matrix a is a learnable model parameter.
Figure BDA0003553527040000077
Indicating that the ith token is mapped to a named entity tag yiIs not normalized. During model training, the loss function is defined as:
Figure BDA0003553527040000078
after decoding, the output label sequence Y with the highest score*The calculation method comprises the following steps:
Figure BDA0003553527040000079
annotated chinese medical NER data are difficult to obtain and are generally small in scale due to privacy, ethical, and highly specialized limitations. To this end, the BBCPR model introduces a new REgularization and adaptive training and Dropout (READ) method to improve the robustness of the model, the READ method uses a REgularization module with antagonism training and discarding, the REgularization module generates an antagonistic word embedding E 'using the antagonistic perturbation generated by FGSM on the word embedding E, the word embedding E and the antagonistic word embedding E' respectively generate two different sub-modules through a Dropout mechanism, and outputs two different model prediction distributions, and then minimizes the bi-directional KL divergence between the two prediction distributions to reduce the prediction difference of the two sub-models. The structure of the READ is shown in fig. 3.
For a given oneI.e. the predicted sentence X ═ { X ═ X1,x2,...,xnY and an output sequence Y ═ Y1,y2,...,ynFirstly, using FGSM to generate adversarial disturbance generated by embedding original words into E' and using dropout mechanism to obtain two different sub-models. Then, the original word embedding E and the adversarial word embedding E 'are respectively sent into the two submodels to output two different model prediction distributions, which respectively represent P (Y | X) and P' (Y | X).
The characterization may be far from the characterization of the original sentence due to resistant perturbations and noise artifacts. In this training step, the READ method focuses on reducing the prediction variance of the two submodels by minimizing the bi-directional KL divergence split between the two output distributions. Formally, this process is represented as:
Figure BDA0003553527040000081
basic learning objective function with two forward channels
Figure BDA0003553527040000082
Can be expressed as:
Figure BDA0003553527040000083
the final training goal being to make the data (X, Y)
Figure BDA0003553527040000084
And (3) minimizing:
Figure BDA0003553527040000085
where λ is the coefficient weight used to balance the two training losses, DKLIs the KL divergence.
When a neural network is trained on a small training set, it generally performs poorly on test data. Dropout is widely used to normalize the fully-connected neural network layer because of its simplicity and efficiency.
In terms of model robustness, fast gradient notation (FGSM) is a popular method for generating antagonistic samples to make neural network models robust to perturbation. The basic principle is that interference is added in the model training process to construct the antagonistic sample, so that the robustness of the model when encountering the antagonistic sample is improved.
The invention designs a new regularization module, and the overfitting problem is processed by combining R-Drop and FGSM, so that the robustness of the model is improved. On the basis of generating the resistant samples, two discards were performed, and the distance between the two sub-models was shortened by KL clustering. Clustering to shorten the distance between two sub-models.
At the same time, much work has introduced the incorporation of external knowledge in order to better identify named entities. The word vector sequence is dynamically generated in a large-scale Chinese media corpus by using a pre-trained BERT model, and a POS fusion layer is designed to incorporate POS labels of words, so that the POS labels can be used as a supervision signal to solve the boundary problem of entity annotation. The BilSTM layer is then used to obtain the location characteristics of each word. Finally, the CRF is used as a decoder to obtain the final prediction tag.
In order to prove the advantages of the model provided by the invention in the CNER in the medical field, the invention carries out comprehensive experiments on two data sets of cMedQANER and cEHRNER.
The cMedQANER and cEHRNER datasets are from Chinese community question answers and annotations of Chinese electronic health records, respectively. The cMedQANER dataset contains 2,063 annotated examples, including 11 types of medically named entities for body, population, department, disease, drug, feature, physiology, symptom, test, time, and treatment. The annotated instances have been divided into 1,673 training instances, 175 training instances, and 215 test instances. The chevrner dataset contains 999 annotated samples with seven medically named entities, including disease and diagnosis (discaseanddiagnosis), surgery (Operation), anatomical site (anatomical part), Drug (Drug), Symptom (Symptom), imaging examination (imagingdisclosure), laboratory test (laboratory test). The annotated sample has been divided into 914 training samples, 44 training samples, and 41 test samples. Table 3 lists the statistics of the cMedQANER and cEHRNER datasets. Tables 4 and 5 list statistics for different types of entities on the cMedQANER and cEHRNER datasets, respectively. To measure CNER performance, we use precision (P), recall (R) and F1 scores, which are widely used indicators for sequence tagging tasks. The concrete formula is as follows:
Figure BDA0003553527040000091
Figure BDA0003553527040000092
Figure BDA0003553527040000093
TABLE 3 statistics of cMedQANER and cEHRNER datasets
Figure BDA0003553527040000094
TABLE 4 statistics of different types of entities on cMedQANER datasets
Figure BDA0003553527040000095
TABLE 5 statistics of different types of entities on cEHRNER datasets
Figure BDA0003553527040000096
Hundred degree LAC tools were used to obtain POS tags for cMedQANER and chevrner datasets. Selecting a pre-trained language model MCBE in an experimentRT is a context embedding layer with 12 layers, 12 self-attention heads, and a hidden size of 768 dimensions. The feed-forward net dimension and the BiLSTM concealment size are set to 1024 and 256, respectively. POS embedding was randomly initialized from a standard normal distribution with the embedding size set to 512. All data sets were trained with a batch size of 32 and a maximum sequence length of 256. During training, AdamW optimizer, beta, is used1=0.9,β20.998, linear learning rate decay plan, and a weight decay of 0.01. The learning rates for MCBERT and BilSTM are set to 7 e-5. The learning rate of the CRF is set to 5 e-3. The drop rate (Dropout rate) and the regularization loss weight λ are set to 0.2 and 2.0, respectively. The model was trained 50 times and the best model was used to predict the test data set. The hyper-parameter configuration is shown in table 6.
TABLE 6 hyper-parameter configuration of the method of the present invention
Figure BDA0003553527040000101
The model proposed by the present invention was compared to several CMNER methods, HMM, BilSTM without CRF layer, BilSTM-CRF (neural architecture for named entity recognition), MCBERT, MC-BERT-CRF. In addition, word2vec embedding was used for the BilSTM and BilSTM-CRF models. MCBERT is a pre-trained language model, further trained in the chinese biomedical corpus. Table 7 shows the performance of these models in terms of accuracy, recall, and F1 scores on the cMedQANER and chevrner datasets. The conventional statistical learning HMM achieves poor results. Furthermore, the additional CRF layer can significantly improve the performance of the model, since CRF can model the dependency between tag sequences. Furthermore, the results show that MC-BERT can capture a better context representation than word2 vec. Moreover, we can see that our method achieves the best results, with the F1 scores of the two datasets increased by 2.08 and 2.35, respectively. This demonstrates the effectiveness of our approach. Our method improves on the smaller dataset cEHRNER over the large dataset cMedQANER. This shows that our method can improve the robustness of the Chinese medical named entity recognition model.
Table 7 model performance comparison on cMedQANER and chevrner datasets.
Figure BDA0003553527040000111
In order to evaluate the effect of two important components in our model, namely the POS fusion layer and the regularization method, ablation studies were performed on the cMedQANER and chehrner test sets. The results are shown in Table 8, READ[AP]And READ[DP]Respectively representing reactive and abortive perturbations in a READ. In general, we can see that when some modules are eliminated, the performance of the model is significantly degraded in all indexes, which can verify the effectiveness of our proposed method. For example, on the cMedQANER and cEHRNER datasets, the decrease in model performance was greatest with the removal of the READ method, with the absolute values of F1 score decreasing by 1.10% and 1.50%, the P score decreasing by 1.72% and 3.10%, and the R score decreasing by 0.47%. This is because READ can improve the robustness of the model by regularizing both distributions of the same sample, i.e. the original distribution and the distributions that are intervened by the Antagonistic Perturbations (AP) and the Declining Perturbations (DP). In addition to exploring the effectiveness of READ, we also investigated the contributions of AP and DP alone. On the cMedQANER dataset, removing AP and DP resulted in a 0.61% and 0.72% decrease in the absolute value of the F1 score, respectively. Similar results can also be found on the chevrner dataset, with the model exhibiting absolute reductions of 0.52% and 0.78%. With respect to P and R, removing AP results in absolute reductions of 0.22% -0.80% and 0.24% -1.00%, while removing DP results in absolute reductions of 1.11% -1.51% and 0.31%. The reason why DP performs better may be that it applies to the entire model, while AP only works on BERT. Furthermore, we have also observed that the combination of resistance and conjugate perturbations is superior to either alone. Since each perturbation is relatively simple, only one method will cause a small degree of perturbation. The diversified perturbations may increase the dissimilarity of the characterization of the same sample. As for the POS fusion layer, this is absolutely eliminated on the cMedQANER and cEHRNER datasetsAfter each module, the fraction of F1 decreased by 0.70% and 0.66%, the fraction of P decreased by 0.12% and 1.22%, and the fraction of R decreased by 1.28% and 0.07%. This shows that adding POS tag features to the deep neural network can significantly improve the performance of the chinese-MNER model, as POS tag features add additional potential entity boundary information.
TABLE 8 Performance of P, R and F1 without different modules on the cMedQANER and cEHRNER datasets
Figure BDA0003553527040000121
We further performed comparative experiments to analyze the impact of different pre-trained models. BERT-wwm employs a full word masking strategy for chinese text. RoBERTa is a robust optimized BERT pre-training method. MacBERT masks a word with similar words in chinese text. MCBERT is a pre-trained language model that is further trained in the chinese biomedical corpus. As shown in FIG. 4, (a) is a cMedQANER set, and (b) is a cEHRNER set. MC-BERT outperforms other pre-trained models. As the MC-BERT adapts to the whole entity shielding strategy and the whole span shielding strategy, medical domain knowledge is injected to obtain better context representation of Chinese biomedical texts. We selected the MCBERT model as the context embedding layer in the following experiments.
We further performed experiments on the cMedQANER and chevrner test sets exploring the relationship of the performance of our model to the size of POS embeddings. In our experiments, the POS insertions were sized to be 128, 256, 384, 512, 640, and 768, respectively. The results are shown in FIG. 5, where (a) is the cMedQANER set and (b) is the cEHRNER set. The performance of our model initially grows as the size of the POS embeddings increases, since increasing the size of the neural network can increase the complexity of the model to achieve more powerful representation capabilities. However, as the scale increases further, performance may deteriorate due to model overfitting.
We further investigated the effect of the regularization loss weight λ. Here we varied λ among {1,2,3,4,5,10} and performed experiments. As shown in fig. 6, the cMedQANER set (a) and the chevrner set (b) are both too small or too large λ sets, which make our model perform poorly. The model achieves the best performance when λ is 2. We chose 2 as the regularization loss weight in the experiment.
During training, the same value (0.1) is set for both distributions. Therefore, the influence of the drop rate (Dropout rate) value was further investigated. We varied the drop rate (Dropout rate) at 0.05,0.1,0.2,0.3,0.4,0.5 and performed experiments. As shown in fig. 7, (a) is a cMedQANER set, and (b) is a chevrner set, and it is found that the same value (0.3 ) as the drop rate (Dropout rate) is the best choice.
While embodiments in accordance with the invention have been described above, these embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments described. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is limited only by the claims and their full scope and equivalents.

Claims (7)

1. A method for identifying Chinese named entities in the medical field is characterized in that the method uses a BBCPR model to identify Chinese named entities in the medical field, the BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer,
the word embedding layer converts a given sentence into a word embedding E and inputs the word embedding E into the BERT embedding layer;
the BERT embedded layer adopts an MCBERT coder to obtain BERT output embedding and is used as POS fusion layer input;
the POS fusion layer connects the BERT output embedding with the POS embedding to obtain fusion embedding, and the POS embedding is obtained by POS tags of given sentences and passing through the POS embedding layer;
the BilSTM layer encodes the fusion embedding to obtain the final implicit representation of the input sequence;
the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence;
the BBCPR model uses a regularization module with antagonism training and disclaimer that generates an antagonistic word embedding E ' using the antagonistic perturbation generated by FGSM on word embedding E ', the word embedding E and the antagonistic word embedding E ' each generate two different sub-modules through a Dropout mechanism and output two different model prediction distributions, and then minimizes the bi-directional KL divergence between the two prediction distributions to reduce the prediction difference of the two sub-models.
2. The identification method according to claim 1, wherein minimizing the bi-directional KL divergence split between two prediction distributions reduces the prediction difference of two submodels, in particular:
the training objective is to make the loss function of the data (X, Y)
Figure FDA0003553527030000011
And (3) minimizing:
Figure FDA0003553527030000012
Figure FDA0003553527030000013
Figure FDA0003553527030000014
wherein λ is coefficient weight, P (Y | X) is probability distribution of word embedding E, P '(Y | X) is probability distribution of adversity word embedding E', DKLAnd KL divergence, X is a prediction sentence, and Y is a tag sequence output by the BBCPR model.
3. Identification according to claim 1The method is characterized in that the POS fusion layer connects BERT output embedding and POS embedding, and specifically comprises the following steps:
Figure FDA0003553527030000015
wherein v isiFor concatenation of BERT output embedding and POS embedding, hiFor output embedding of BERT, piEmbed the ith tagged POS in the sentence.
4. The recognition method according to claim 3, wherein the given sentence is obtained a POS tag by a LAC tool, and the POS tag is fed to a POS embedding layer to obtain a POS embedding.
5. The recognition method of claims 1-4, wherein each token in the sentence is predicted using a BIO tag.
6. The method according to claims 1-4, wherein the MCBERT comprises a stack of L identical layers, each layer comprising two sublayers, wherein the first sublayer is a multi-headed self-attention mechanism and the second sublayer is a fully connected feedforward neural network, the two sublayers being connected in sequence by residual connection and layer normalization.
7. The identification method according to claims 1-4, wherein the calculation process of BilSTM is:
Figure FDA0003553527030000021
Figure FDA0003553527030000022
Figure FDA0003553527030000023
wherein
Figure FDA0003553527030000024
And
Figure FDA0003553527030000025
representing forward LSTM and backward LSTM, respectively.
CN202210268640.5A 2022-03-18 2022-03-18 Method for identifying Chinese named entities in medical field Pending CN114638214A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210268640.5A CN114638214A (en) 2022-03-18 2022-03-18 Method for identifying Chinese named entities in medical field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210268640.5A CN114638214A (en) 2022-03-18 2022-03-18 Method for identifying Chinese named entities in medical field

Publications (1)

Publication Number Publication Date
CN114638214A true CN114638214A (en) 2022-06-17

Family

ID=81949722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210268640.5A Pending CN114638214A (en) 2022-03-18 2022-03-18 Method for identifying Chinese named entities in medical field

Country Status (1)

Country Link
CN (1) CN114638214A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115630649A (en) * 2022-11-23 2023-01-20 南京邮电大学 Medical Chinese named entity recognition method based on generative model
CN116341556A (en) * 2023-05-29 2023-06-27 浙江工业大学 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115630649A (en) * 2022-11-23 2023-01-20 南京邮电大学 Medical Chinese named entity recognition method based on generative model
CN116341556A (en) * 2023-05-29 2023-06-27 浙江工业大学 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Similar Documents

Publication Publication Date Title
US6601049B1 (en) Self-adjusting multi-layer neural network architectures and methods therefor
CN111881260A (en) Neural network emotion analysis method and device based on aspect attention and convolutional memory
CN114638214A (en) Method for identifying Chinese named entities in medical field
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN112784532B (en) Multi-head attention memory system for short text sentiment classification
Wan et al. A self-attention based neural architecture for Chinese medical named entity recognition
CN113657123A (en) Mongolian aspect level emotion analysis method based on target template guidance and relation head coding
CN112818118A (en) Reverse translation-based Chinese humor classification model
Liu et al. Deep neural network-based recognition of entities in Chinese online medical inquiry texts
Li et al. Biomedical named entity recognition based on the two channels and sentence-level reading control conditioned LSTM-CRF
Tan et al. Chinese medical named entity recognition based on Chinese character radical features and pre-trained language models
He et al. Neural unsupervised reconstruction of protolanguage word forms
Zhang et al. Medical assertion classification in Chinese EMRs using attention enhanced neural network
CN115964475A (en) Dialogue abstract generation method for medical inquiry
Xu et al. A Data‐Driven Model for Automated Chinese Word Segmentation and POS Tagging
CN114492444A (en) Chinese electronic medical case medical entity part-of-speech tagging method
Hu et al. Contextual-aware information extractor with adaptive objective for chinese medical dialogues
Cui et al. Learning effective word embedding using morphological word similarity
Shivakumar et al. Behavior gated language models
Qiu et al. Question answering based clinical text structuring using pre-trained language model
Yelisetti et al. Aspect-based text classification for sentimental analysis using attention mechanism with RU-BiLSTM
Jiang et al. APIE: An information extraction module designed based on the pipeline method
Worke INFORMATION EXTRACTION MODEL FROM GE’EZ TEXTS
Noriega-Atala et al. Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision
Wang et al. Clinical named entity recognition for percutaneous coronary intervention surgical information with hybrid neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination