CN114638214A

CN114638214A - Method for identifying Chinese named entities in medical field

Info

Publication number: CN114638214A
Application number: CN202210268640.5A
Authority: CN
Inventors: 陈洪辉; 江苗; 王梦如; 蔡飞; 舒振; 宋城宇; 张鑫; 陈翀昊; 邵太华; 郑建明
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-06-17

Abstract

The invention discloses a method for identifying a Chinese named entity in the medical field, which uses a BBCPR model to identify the Chinese named entity in the medical field, wherein the BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer, the word embedding layer converts a given sentence into words to be embedded and inputs the words into the BERT embedding layer; the BERT embedding layer adopts an MCBERT coder to obtain BERT output embedding; the POS fusion layer connects the BERT output embedding and the POS embedding to obtain fusion embedding; the BilSTM layer encodes the fusion embedding to obtain the final implicit expression of the input sequence; and the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence. The method can definitely learn word boundary information, and meanwhile, can solve the over-fitting problem and enhance the robustness of the model on small data.

Description

Method for identifying Chinese named entities in medical field

Technical Field

The invention belongs to the technical field of artificial intelligence and digital medical treatment, and particularly relates to a Chinese named entity identification method in the medical treatment field.

Background

Named Entity Recognition (NER) is a core task of Natural Language Processing (NLP) that aims at identifying potential entities and their classes from unstructured text. As an important component of many NLP downstream tasks (such as relationship extraction and information retrieval), NER has been a hot problem in the NLP world and attracts many people's attention. In general, previous work has mostly been for NER tasks in English and has achieved good performance by integrating character-level features.

East asian languages (such as chinese) typically lack explicit word boundaries and have complex formation, which is more challenging to NER models than NERs in english. For example, the performance of the model (SOTAs) which is best performed in the chinese NER task is far lower than the SOTAs of the english NER task, and there is a gap of nearly 10% in the F1 evaluation index. Furthermore, recent research has focused more on domain-specific NERs, such as medicine, which are complex and require expertise in the external domain.

Chinese Named Entity Recognition (CNER) in the medical field is known as a character-level sequence label problem, while english is word-level. Recently, deep learning methods are widely used for CNER tasks because they have excellent ability to automatically extract features from mass data. For example, previous work typically utilized a two-way long-short-term memory (BilSTM) network to capture sequence information and achieve comparable results. Furthermore, because of the superior ability of language models to extract context information, converter-based models (such as BERT) are becoming a new paradigm for CNER.

Especially in the medical field, external expertise facilitates the model to understand the technical terms and boundaries of recognized words, which has prompted recent research to add dictionary knowledge on the basis of the traditional BilSTM-CRF or BERT architecture. However, high quality dictionary construction typically requires a lot of time and expertise, which is very expensive and labor intensive. Furthermore, these dictionary-based approaches may reduce the versatility and robustness of the NER model. At the same time, annotated chinese medical NER data is difficult to obtain and is generally small in scale due to privacy, ethical, and highly specialized restrictions. The scale is small, and the overfitting problem of the model is easily caused.

Disclosure of Invention

In order to solve the problems, the invention provides a method for improving the Chinese named entity recognition in the medical field by using part of speech information and a new regularization method. The method uses BBCPR (BERT-BilSTM-CRF with POS and Regularization) model to identify named entities in the medical field, the BBCPR model utilizes POS (language) fusion layer to incorporate external grammar knowledge, and introduces a novel READ (simulation and Adversal training and drop) method to improve the robustness of the model. The BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer.

The word embedding layer converts a given sentence into word embedding E and inputs it to the BERT embedding layer.

And the BERT embedded layer adopts an MCBERT encoder to obtain BERT output embedding and is used as POS fusion layer input.

The POS fusion layer connects the BERT output embedding and the POS embedding to obtain fusion embedding, and the POS embedding is obtained by POS tags of given sentences through the POS embedding layer.

The BilSTM layer encodes the fusion embedding to obtain the final implicit representation of the input sequence.

And the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence.

The BBCPR model uses a regularization module with antagonism training and disclaimer that generates an antagonistic word embedding E ' using the antagonistic perturbation generated by FGSM on word embedding E ', the word embedding E and the antagonistic word embedding E ' each generate two different sub-modules through a Dropout mechanism and output two different model prediction distributions, and then minimizes the bi-directional KL divergence between the two prediction distributions to reduce the prediction difference of the two sub-models.

The minimizing of the bi-directional KL divergence between the two prediction distributions reduces the prediction difference of the two submodels, the training objective being to make the loss function of the data (X, Y)

And (3) minimizing:

wherein λ is coefficient weight, P (Y | X) is probability distribution of word embedding E, P '(Y | X) is probability distribution of adversity word embedding E', D_KLAnd the divergence is KL, X is a prediction sentence, and Y is a label sequence output by the BBCPR model.

The POS fusion layer connects BERT output embedding and POS embedding, and specifically comprises the following steps:

wherein v is_iFor concatenation of BERT output embedding and POS embedding, h_iFor output embedding of BERT, p_iEmbed the ith tagged POS in the sentence.

Further, the given sentence is passed through a lac (local Analysis of chinese) tool to obtain a POS tag, which is fed into a POS embedding layer to obtain a POS embedding.

Further, a BIO tag (Begin, Inside, out) is used to predict each marker in the sentence.

Further, the MCBERT comprises a stack of L identical layers, each layer comprising two sublayers, wherein the first sublayer is a multi-headed self-attention mechanism, the second sublayer is a fully-connected feedforward neural network, and the two sublayers are connected in sequence by residual connection and layer normalization.

Further, the calculation process of BilSTM is as follows:

wherein

And

representing forward LSTM and backward LSTM, respectively.

The invention designs a POS fusion module based on MCBERT combined with BilSTM-CRF, which brings POS information into a model and can be used for guiding the model to definitely learn word boundary information of a CNER task in the medical field. Meanwhile, a new regularization method is provided to regularize the model output on the antagonism training and the conjugate disturbance samples to solve the overfitting problem and enhance the robustness of the model on small data. The invention performs comprehensive experiments on cMedQANER and cEHRNER data sets. The results show that the BBCPR model outperforms several competitive baselines, while ablation studies demonstrate the effectiveness of the proposed method.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is an overall architecture diagram of a BBCPR model.

FIG. 2 is a block diagram of a READ regularization method.

FIG. 3 is a schematic diagram of word embedding.

FIG. 4 is a graph of the effect of BBCPR model on the cMedQANER and cEHRNER test sets.

FIG. 5 is a graph of the effect of POS embedding size on the cMedQANER and cEHRNER test set.

FIG. 6 is a graph of the effect of the regularization loss function weight λ on the cMedQANER and cEHRNER test sets.

FIG. 7 is a graph of the effect of drop rate droperforate on the cMedQANER and cEHRNER test set.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As shown in FIG. 1, the BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer.

The CNER task in the medical field is designed to identify and predict entities in text (e.g. diseases, symptoms, drugs, etc.). The present invention regards the CNER task in the medical field as a sequence tagging problem. Given a sentence X ═ X₁,x₂,...,x_nThe goal is to predict each marker X in sentence X with a BIO tag (Begin, Inside, Outside)_iAnd obtaining a tag sequence Y ═ Y₁,y₂,...,y_nAs output. Examples of tagged entities in sentences are shown in Table 1, where B-s represents the onset of a physical symptom, I-s represents the interior of a physical symptom, B-d represents the onset of a physical disease, I-d represents the interior of a physical disease, B-p represents the onset of a physical character person, I-p represents the interior of a physical character person, and O represents an external entity.

Table 1 entity labels in sentences

For clarity, the main symbols used are summarized in table 2.

Table 2 prime notation used herein

BERT is a pre-trained model with multi-layered bidirectional transducer coders that obtain deep language tokens through joint learning of layers and context. MCBERT is a further training of BERT on the chinese medical corpus, with the same structure as BERT. It comprises a stack of N identical layers. Each layer comprises two sublayers, wherein the first sublayer is a multi-head self-attention mechanism, and the second sublayer is a fully-connected feedforward neural network. The two sub-layers are connected in sequence by residual connection and layer normalization.

The word embedding layer first converts a given sentence X into a series of word embeddings E ═ E₁,e₂,...,e_nMark embedding, segment embedding, and position embedding, as shown in fig. 2, simply.

The BERT embedded layer adopts an MCBERT coder, and the input of the MCBERT coder is E. For convenience, the input of the first layer is denoted as H₀Let the output of the first layer be H_l. The output of the previous layer represents H_l-1Is fed into the multi-headed self-concern (MSA) sublayer to obtain a context-level representation. Output representation of the network (FFN) sublayer. These operations are expressed as:

H′_l＝LayerNorm(MSA(H_l-1)+H_i ^l-1)

H_l＝LayerNorm(H′_l+FFN(H′_l))

final BERT output embedding { h₁,h₂,...,h_nIs sent to the POS blending layer.

The POS fusion layer incorporates the features of the speech part into the model, which links the BERT output embedding with the POS embedding, which is derived from the POS tag of a given sentence and through the POS embedding layer, resulting in a fused embedding. Since BERT treats text as a markup, and generates markup level inlays. In chinese, however, words are often used as the smallest unit for expressing semantics. While the label-based model can avoid segmentation errors, it loses part of the semantics while also increasing the difficulty of entity boundary extraction. POS refers to word features that are the basis for word classification, including nouns, verbs, adjectives, modal particles, and so on. POS can greatly improve the annotation of NER.

For a given sentence X ═ X₁,x₂,...,x_nObtaining POS tag T ═ T with Baidu LAC tool₁,t₂,...,t_n}. Then, T is sent into a POS embedding layer to obtain POS embedding P ═ P₁,p₂,…,p_nTherein of

d_pIndicating the size of the POS inlay. Embedding BERT output into h by POS fusion layer_iAnd POS embedding p_iThe connection of the two-way pipe is realized,

wherein

The BilSTM layer can capture more comprehensive context information, and the application introduces the BilSTM to encode fusion embedding, which consists of a forward LSTM and a backward LSTM. And the LSTM can more accurately capture the long-distance dependency relationship, and the problem of gradient disappearance or gradient explosion caused by the standard RNN is avoided. Thus, BilSTM can better capture long-range and bi-directional semantic dependencies by learning context information in both directions. An LSTM unit at time t may be defined as follows:

i_t＝σ(W_i[h_t-1,v_t]+b_i)

f_t＝σ(W_f[h_t-1,v_t]+b_f)

o_t＝σ(W_o[h_t-1,v_t]+b_o)

h_t＝o_t⊙tanh(C_t)

wherein v is_tAnd h_tThe input vector representing time t and the hidden state. W_i、W_f、W_c、W_oAnd b_i、b_f、b_c、b_oIs a learnable parameter of LSTM. σ is a sigmoid function.

A dot product function is represented. i.e. i_tIndicating the input gate, determines the storage of the information at time t. f. of_tIndicating a forgetting gate that determines the information that was discarded at the previous time. o_tAn output gate is shown, which determines the output of the information.

And C_tCell states representing candidate cell states and time t, respectively. The calculation process for BilSTM is represented as follows:

wherein

And

representing forward LSTM and backward LSTM, respectively. The final implicit representation of the input sequence is represented as

Wherein d is_LSTMIndicating the hidden size of the LSTM.

In the named entity recognition task, there are order relationships and some constraint rules between adjacent tags, e.g., an I-symptom tag must appear after a B-symptom tag. BilSTM focuses on long-term context information rather than on dependencies between tags, while CRF can model sequential relationships between tags by learning neighboring relationships. The present invention utilizes a standard CRF layer to decode the final sequence tags.

The output H of BiLSTM is first converted to the input matrix P of CRF by a linear function.

P＝W_pH+b_p

Wherein

Is a learnable parameter, and k is the number of tag types. The purpose of the CRF is to calculate a random variable sequence Y ═ Y₁,y₂,...,y_n-conditional probability distribution P (Y | X). In a given sentence X ═ X₁,x₂,...,x_nThe probability Y of its final optimal tag sequence can be calculated as:

wherein

A tag sequence representing ground truth. Y is_XRepresenting the set of all possible tag sequences.

Indicating slave label y_iTo label y_i+1The transition probability of (2). And the transition probability matrix a is a learnable model parameter.

Indicating that the ith token is mapped to a named entity tag y_iIs not normalized. During model training, the loss function is defined as:

after decoding, the output label sequence Y with the highest score^*The calculation method comprises the following steps:

annotated chinese medical NER data are difficult to obtain and are generally small in scale due to privacy, ethical, and highly specialized limitations. To this end, the BBCPR model introduces a new REgularization and adaptive training and Dropout (READ) method to improve the robustness of the model, the READ method uses a REgularization module with antagonism training and discarding, the REgularization module generates an antagonistic word embedding E 'using the antagonistic perturbation generated by FGSM on the word embedding E, the word embedding E and the antagonistic word embedding E' respectively generate two different sub-modules through a Dropout mechanism, and outputs two different model prediction distributions, and then minimizes the bi-directional KL divergence between the two prediction distributions to reduce the prediction difference of the two sub-models. The structure of the READ is shown in fig. 3.

For a given oneI.e. the predicted sentence X ═ { X ═ X₁,x₂,...,x_nY and an output sequence Y ═ Y₁,y₂,...,y_nFirstly, using FGSM to generate adversarial disturbance generated by embedding original words into E' and using dropout mechanism to obtain two different sub-models. Then, the original word embedding E and the adversarial word embedding E 'are respectively sent into the two submodels to output two different model prediction distributions, which respectively represent P (Y | X) and P' (Y | X).

The characterization may be far from the characterization of the original sentence due to resistant perturbations and noise artifacts. In this training step, the READ method focuses on reducing the prediction variance of the two submodels by minimizing the bi-directional KL divergence split between the two output distributions. Formally, this process is represented as:

basic learning objective function with two forward channels

Can be expressed as:

the final training goal being to make the data (X, Y)

And (3) minimizing:

where λ is the coefficient weight used to balance the two training losses, D_KLIs the KL divergence.

When a neural network is trained on a small training set, it generally performs poorly on test data. Dropout is widely used to normalize the fully-connected neural network layer because of its simplicity and efficiency.

In terms of model robustness, fast gradient notation (FGSM) is a popular method for generating antagonistic samples to make neural network models robust to perturbation. The basic principle is that interference is added in the model training process to construct the antagonistic sample, so that the robustness of the model when encountering the antagonistic sample is improved.

The invention designs a new regularization module, and the overfitting problem is processed by combining R-Drop and FGSM, so that the robustness of the model is improved. On the basis of generating the resistant samples, two discards were performed, and the distance between the two sub-models was shortened by KL clustering. Clustering to shorten the distance between two sub-models.

At the same time, much work has introduced the incorporation of external knowledge in order to better identify named entities. The word vector sequence is dynamically generated in a large-scale Chinese media corpus by using a pre-trained BERT model, and a POS fusion layer is designed to incorporate POS labels of words, so that the POS labels can be used as a supervision signal to solve the boundary problem of entity annotation. The BilSTM layer is then used to obtain the location characteristics of each word. Finally, the CRF is used as a decoder to obtain the final prediction tag.

In order to prove the advantages of the model provided by the invention in the CNER in the medical field, the invention carries out comprehensive experiments on two data sets of cMedQANER and cEHRNER.

The cMedQANER and cEHRNER datasets are from Chinese community question answers and annotations of Chinese electronic health records, respectively. The cMedQANER dataset contains 2,063 annotated examples, including 11 types of medically named entities for body, population, department, disease, drug, feature, physiology, symptom, test, time, and treatment. The annotated instances have been divided into 1,673 training instances, 175 training instances, and 215 test instances. The chevrner dataset contains 999 annotated samples with seven medically named entities, including disease and diagnosis (discaseanddiagnosis), surgery (Operation), anatomical site (anatomical part), Drug (Drug), Symptom (Symptom), imaging examination (imagingdisclosure), laboratory test (laboratory test). The annotated sample has been divided into 914 training samples, 44 training samples, and 41 test samples. Table 3 lists the statistics of the cMedQANER and cEHRNER datasets. Tables 4 and 5 list statistics for different types of entities on the cMedQANER and cEHRNER datasets, respectively. To measure CNER performance, we use precision (P), recall (R) and F1 scores, which are widely used indicators for sequence tagging tasks. The concrete formula is as follows:

TABLE 3 statistics of cMedQANER and cEHRNER datasets

TABLE 4 statistics of different types of entities on cMedQANER datasets

TABLE 5 statistics of different types of entities on cEHRNER datasets

Hundred degree LAC tools were used to obtain POS tags for cMedQANER and chevrner datasets. Selecting a pre-trained language model MCBE in an experimentRT is a context embedding layer with 12 layers, 12 self-attention heads, and a hidden size of 768 dimensions. The feed-forward net dimension and the BiLSTM concealment size are set to 1024 and 256, respectively. POS embedding was randomly initialized from a standard normal distribution with the embedding size set to 512. All data sets were trained with a batch size of 32 and a maximum sequence length of 256. During training, AdamW optimizer, beta, is used₁＝0.9，β₂0.998, linear learning rate decay plan, and a weight decay of 0.01. The learning rates for MCBERT and BilSTM are set to 7 e-5. The learning rate of the CRF is set to 5 e-3. The drop rate (Dropout rate) and the regularization loss weight λ are set to 0.2 and 2.0, respectively. The model was trained 50 times and the best model was used to predict the test data set. The hyper-parameter configuration is shown in table 6.

TABLE 6 hyper-parameter configuration of the method of the present invention

The model proposed by the present invention was compared to several CMNER methods, HMM, BilSTM without CRF layer, BilSTM-CRF (neural architecture for named entity recognition), MCBERT, MC-BERT-CRF. In addition, word2vec embedding was used for the BilSTM and BilSTM-CRF models. MCBERT is a pre-trained language model, further trained in the chinese biomedical corpus. Table 7 shows the performance of these models in terms of accuracy, recall, and F1 scores on the cMedQANER and chevrner datasets. The conventional statistical learning HMM achieves poor results. Furthermore, the additional CRF layer can significantly improve the performance of the model, since CRF can model the dependency between tag sequences. Furthermore, the results show that MC-BERT can capture a better context representation than word2 vec. Moreover, we can see that our method achieves the best results, with the F1 scores of the two datasets increased by 2.08 and 2.35, respectively. This demonstrates the effectiveness of our approach. Our method improves on the smaller dataset cEHRNER over the large dataset cMedQANER. This shows that our method can improve the robustness of the Chinese medical named entity recognition model.

Table 7 model performance comparison on cMedQANER and chevrner datasets.

In order to evaluate the effect of two important components in our model, namely the POS fusion layer and the regularization method, ablation studies were performed on the cMedQANER and chehrner test sets. The results are shown in Table 8, READ_[AP]And READ_[DP]Respectively representing reactive and abortive perturbations in a READ. In general, we can see that when some modules are eliminated, the performance of the model is significantly degraded in all indexes, which can verify the effectiveness of our proposed method. For example, on the cMedQANER and cEHRNER datasets, the decrease in model performance was greatest with the removal of the READ method, with the absolute values of F1 score decreasing by 1.10% and 1.50%, the P score decreasing by 1.72% and 3.10%, and the R score decreasing by 0.47%. This is because READ can improve the robustness of the model by regularizing both distributions of the same sample, i.e. the original distribution and the distributions that are intervened by the Antagonistic Perturbations (AP) and the Declining Perturbations (DP). In addition to exploring the effectiveness of READ, we also investigated the contributions of AP and DP alone. On the cMedQANER dataset, removing AP and DP resulted in a 0.61% and 0.72% decrease in the absolute value of the F1 score, respectively. Similar results can also be found on the chevrner dataset, with the model exhibiting absolute reductions of 0.52% and 0.78%. With respect to P and R, removing AP results in absolute reductions of 0.22% -0.80% and 0.24% -1.00%, while removing DP results in absolute reductions of 1.11% -1.51% and 0.31%. The reason why DP performs better may be that it applies to the entire model, while AP only works on BERT. Furthermore, we have also observed that the combination of resistance and conjugate perturbations is superior to either alone. Since each perturbation is relatively simple, only one method will cause a small degree of perturbation. The diversified perturbations may increase the dissimilarity of the characterization of the same sample. As for the POS fusion layer, this is absolutely eliminated on the cMedQANER and cEHRNER datasetsAfter each module, the fraction of F1 decreased by 0.70% and 0.66%, the fraction of P decreased by 0.12% and 1.22%, and the fraction of R decreased by 1.28% and 0.07%. This shows that adding POS tag features to the deep neural network can significantly improve the performance of the chinese-MNER model, as POS tag features add additional potential entity boundary information.

TABLE 8 Performance of P, R and F1 without different modules on the cMedQANER and cEHRNER datasets

We further performed comparative experiments to analyze the impact of different pre-trained models. BERT-wwm employs a full word masking strategy for chinese text. RoBERTa is a robust optimized BERT pre-training method. MacBERT masks a word with similar words in chinese text. MCBERT is a pre-trained language model that is further trained in the chinese biomedical corpus. As shown in FIG. 4, (a) is a cMedQANER set, and (b) is a cEHRNER set. MC-BERT outperforms other pre-trained models. As the MC-BERT adapts to the whole entity shielding strategy and the whole span shielding strategy, medical domain knowledge is injected to obtain better context representation of Chinese biomedical texts. We selected the MCBERT model as the context embedding layer in the following experiments.

We further performed experiments on the cMedQANER and chevrner test sets exploring the relationship of the performance of our model to the size of POS embeddings. In our experiments, the POS insertions were sized to be 128, 256, 384, 512, 640, and 768, respectively. The results are shown in FIG. 5, where (a) is the cMedQANER set and (b) is the cEHRNER set. The performance of our model initially grows as the size of the POS embeddings increases, since increasing the size of the neural network can increase the complexity of the model to achieve more powerful representation capabilities. However, as the scale increases further, performance may deteriorate due to model overfitting.

We further investigated the effect of the regularization loss weight λ. Here we varied λ among {1,2,3,4,5,10} and performed experiments. As shown in fig. 6, the cMedQANER set (a) and the chevrner set (b) are both too small or too large λ sets, which make our model perform poorly. The model achieves the best performance when λ is 2. We chose 2 as the regularization loss weight in the experiment.

During training, the same value (0.1) is set for both distributions. Therefore, the influence of the drop rate (Dropout rate) value was further investigated. We varied the drop rate (Dropout rate) at 0.05,0.1,0.2,0.3,0.4,0.5 and performed experiments. As shown in fig. 7, (a) is a cMedQANER set, and (b) is a chevrner set, and it is found that the same value (0.3 ) as the drop rate (Dropout rate) is the best choice.

While embodiments in accordance with the invention have been described above, these embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments described. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A method for identifying Chinese named entities in the medical field is characterized in that the method uses a BBCPR model to identify Chinese named entities in the medical field, the BBCPR model consists of a word embedding layer, a BERT embedding layer, a POS fusion layer, a BilSTM layer and a CRF layer,

the word embedding layer converts a given sentence into a word embedding E and inputs the word embedding E into the BERT embedding layer;

the BERT embedded layer adopts an MCBERT coder to obtain BERT output embedding and is used as POS fusion layer input;

the POS fusion layer connects the BERT output embedding with the POS embedding to obtain fusion embedding, and the POS embedding is obtained by POS tags of given sentences and passing through the POS embedding layer;

the BilSTM layer encodes the fusion embedding to obtain the final implicit representation of the input sequence;

the CRF layer decodes the output of the BilSTM layer to obtain a label sequence and outputs the label sequence;

2. The identification method according to claim 1, wherein minimizing the bi-directional KL divergence split between two prediction distributions reduces the prediction difference of two submodels, in particular:

the training objective is to make the loss function of the data (X, Y)

And (3) minimizing:

wherein λ is coefficient weight, P (Y | X) is probability distribution of word embedding E, P '(Y | X) is probability distribution of adversity word embedding E', D_KLAnd KL divergence, X is a prediction sentence, and Y is a tag sequence output by the BBCPR model.

3. Identification according to claim 1The method is characterized in that the POS fusion layer connects BERT output embedding and POS embedding, and specifically comprises the following steps:

4. The recognition method according to claim 3, wherein the given sentence is obtained a POS tag by a LAC tool, and the POS tag is fed to a POS embedding layer to obtain a POS embedding.

5. The recognition method of claims 1-4, wherein each token in the sentence is predicted using a BIO tag.

6. The method according to claims 1-4, wherein the MCBERT comprises a stack of L identical layers, each layer comprising two sublayers, wherein the first sublayer is a multi-headed self-attention mechanism and the second sublayer is a fully connected feedforward neural network, the two sublayers being connected in sequence by residual connection and layer normalization.

7. The identification method according to claims 1-4, wherein the calculation process of BilSTM is:

wherein

And

representing forward LSTM and backward LSTM, respectively.