CN115859978A - Named entity recognition model and method based on Roberta radical enhanced adapter - Google Patents

Named entity recognition model and method based on Roberta radical enhanced adapter Download PDF

Info

Publication number
CN115859978A
CN115859978A CN202211389670.8A CN202211389670A CN115859978A CN 115859978 A CN115859978 A CN 115859978A CN 202211389670 A CN202211389670 A CN 202211389670A CN 115859978 A CN115859978 A CN 115859978A
Authority
CN
China
Prior art keywords
radical
roberta
sequence
character
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211389670.8A
Other languages
Chinese (zh)
Inventor
张蕾
戴司宇
张丽娟
高蕾
万健
陈芳妮
王海江
黄杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202211389670.8A priority Critical patent/CN115859978A/en
Publication of CN115859978A publication Critical patent/CN115859978A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of computer application, and discloses a named entity recognition model and a method based on a Roberta radical enhanced adapter, wherein the named entity recognition model comprises a radical adapter, a radical enhanced Roberta model and a conditional random field; the radical adapter is used for sending radical characteristics into the bottom layer of Roberta to fully fuse information; the radical enhanced Roberta model is used for extracting semantic features by using a Roberta model of a full word mask scheme; conditional random fields are used to output a conditional probability distribution model of another set of random variables given a set of input random variable conditions. The invention aims at the insufficiency of the context information in the short text and considers that the radical contains deep semantic information, and combines the radical feature to the bottom layer of Roberta to fully fuse the feature. And multiple groups of comparison experiments prove that the model has good performance and the bottom layer fused radical characteristics have greater superiority on the two data sets.

Description

Named entity recognition model and method based on Roberta radical enhanced adapter
Technical Field
The invention belongs to the technical field of computer application, and particularly relates to a named entity recognition model and method based on a Roberta radical enhanced adapter.
Background
Named entity recognition is the basis of many natural language processing tasks, aims at correctly recognizing entities in texts by referring, and has important influence on subsequent research due to the accuracy of a model. The named entity recognition technology is based on a rule dictionary method in the early days, and the method is poor in portability and consumes a large amount of manpower. Subsequently, named entity recognition methods based on machine learning are gradually coming into view of people, but such methods still require manually constructed features. With the development of natural language processing technology, deep learning methods gradually become mainstream models for named entity recognition. Especially in recent years, the pre-training model shows excellent performance in many tasks, and many scholars extract features by using the deep learning model and combine the features with the pre-training model, so that the effect of named entity recognition is improved. However, the current feature enhancement is still fusion at the model level, and the features cannot be deeply interacted.
The invention further enhances the feature fusion to relieve the problem of insufficient semantic information in the short text, thereby improving the recognition performance of the named entity. Therefore, a radical adapter is designed to blend radical information into the Roberta bottom layer, so that the features carry out deep knowledge interaction on the Roberta model bottom layer. In addition, conditional Random Fields (CRF) are used for sequence labeling to obtain optimal sequence tags, taking into account the dependencies between adjacent tags.
Disclosure of Invention
The invention aims to provide a named entity recognition model and a named entity recognition method based on a Roberta radical enhanced adapter, so as to solve the technical problems.
In order to solve the technical problems, the specific technical scheme of the named entity recognition model and method based on the Roberta radical enhanced adapter is as follows:
a named entity recognition model based on a Roberta radical enhanced adapter comprises a radical adapter, a radical enhanced Roberta model and a conditional random field; the radical adapter is used for sending radical characteristics into bottom-layer full fusion information of Roberta; the radical enhanced Roberta model is used for extracting semantic features by using a Roberta model of a full word mask scheme; the conditional random field is used to output a conditional probability distribution model of another set of random variables given a set of input random variable conditions.
Further, the radical adapter comprises means for performing the following steps:
the radical adapter input is divided into two parts of characters and radicals, a radical vector is aligned with a character vector by using double-line attention, then the aligned radical vector is combined with the character vector to obtain a character-radical pair representation, and finally the combined vector representation is subjected to a normalization layer to output a final result;
for a text with a character length of n, the character sequence is represented by output vectors of an encoding layer in Roberta
Figure BDA0003931485630000021
The radical information corresponding to the character sequence is encoded into a vector and expressed as the vector ^ er>
Figure BDA0003931485630000022
To align these two vector representations, the radical vector is non-linearly transformed, with the ith element:
Figure BDA0003931485630000023
/>
wherein W 1 Is dimension d c *d r Matrix of W 2 Is dimension d c *d c Matrix of b 1 And b 2 Is an offset term, d r Dimension representing radical embedding, d c Represents the dimension of the Roberta hidden layer;
and then adding the transformed radical vector and the character vector to obtain character-radical vector representation:
Figure BDA0003931485630000024
finally, outputting a final result through a dropout layer and a normalization layer, and fusing the character sequence and the radical sequence to generate a vector
Figure BDA0003931485630000025
Further, the radical enhanced Roberta model includes means for performing the following steps:
adding a special identifier [ CLS ] to the beginning of each sentence in the input part]Between sentences using SEP]Separating the separators, then embedding the input sequence by three parts to obtain sequence representation, wherein each input character is formed by adding token embedding, segment embedding and position embedding, and the character E in the sequence t The following formula is formed:
E t =E token_emb +E seg_emb +E pos_emb #(3-3)
the most core part of the Roberta model is composed of 12 layers of transform encoders, wherein output vectors corresponding to [ CLS ] are used as semantic representations of the whole text;
the Roberta enhanced by the radical adapter is to inject the radical adapter into a certain layer of the Roberta, connect the radical adapter between certain transformers inside the Roberta and inject the external radical knowledge into the Roberta;
for a given text of n characters, the words thereofSymbol sequence C = { C 1 ,c 2 ,…,c n Matching the characters with a radical dictionary to obtain a corresponding radical sequence R = { R } 1 ,r 2 ,…,r n And then inputting the character sequence into an embedding layer of Roberta, inputting the obtained embedded representation into a transform coder, and firstly obtaining the output of k transform in order to inject dictionary information between the kth and the (k + 1) th transform
Figure BDA0003931485630000031
Figure BDA0003931485630000032
Then each pair of character and radical is passed through the radical adapter to obtain character-radical representation, and the ith character->
Figure BDA0003931485630000033
And the ith radical->
Figure BDA0003931485630000034
Past radical adapter is denoted as pick>
Figure BDA0003931485630000035
Figure BDA0003931485630000036
Then the sequence obtained by the radical adapter
Figure BDA0003931485630000037
Inputting the data into the remaining 12-k layer transformers to finally obtain the output T = { T = { (T) } 1 ,t 2 ,…,t n }。
Further, the conditional random field includes means for performing the steps of:
given the output of the last layer of Roberta, T = { T = { T } 1 ,t 2 ,…,t n H, first calculate the score of the predicted sequence as:
O=W o T+b o #(3-5)
then y = { y) for the tag sequence 1 ,y 2 ,…,y n The probability of is defined as shown in:
Figure BDA0003931485630000038
wherein Q is a transfer matrix, and Q is a transfer matrix,
Figure BDA0003931485630000039
representing slave label y i-1 To the label y i Is selected, is selected>
Figure BDA00039314856300000310
Representing a character t i Is predicted as label y i Is based on the score of->
Figure BDA00039314856300000311
All possible tag sequences, the numerator represents the score that the current tag sequence is the correct sequence, and the denominator represents the score of each sequence; />
Given N tag data
Figure BDA00039314856300000312
The model is trained by minimizing sentence-level negative log-likelihood loss as follows:
Figure BDA00039314856300000313
finally, in the decoding process, a Viterbi algorithm is adopted to find out the label sequence with the highest score, and the calculation formula is as follows:
Figure BDA00039314856300000314
wherein y is * The sequence that maximizes the score function is taken among all the tag sequences.
The invention also discloses a named entity identification method based on the Roberta radical enhanced adapter, which comprises the following steps:
step 1, utilizing a radical adapter to send radical characteristics to the bottom layer of Roberta to fully fuse characteristic information; the method comprises the steps that input of a user is divided into two parts, namely, a character and a radical, the radical vector representation is aligned with a character vector through nonlinear transformation, then the aligned radical vector is combined with the character vector to obtain a character-radical pair representation, and finally the combined vector representation is output through a normalization layer to obtain a final result;
step 2, radical enhancement Roberta: connecting a radical adapter between transformers inside Roberta, thereby injecting external radical knowledge into Roberta;
and 3, finding out a label sequence path with the maximum probability for the input sequence by using the conditional random field.
Further, the step 1 comprises the following specific steps:
step 1.1: firstly, for a text with a character length of n, the character sequence is represented by output vectors of an encoding layer in Roberta
Figure BDA0003931485630000041
The radical information corresponding to the character sequence is encoded into a vector and expressed as the vector
Figure BDA0003931485630000042
To align these two vector representations, the radical vector is non-linearly transformed, with the ith element:
Figure BDA0003931485630000043
wherein W 1 Is dimension d c *d r Matrix of W 2 Is dimension d c *d c Matrix of b 1 And b 2 Is an offset term, d r Dimension representing radical embedding, d c Represents the dimension of the Roberta hidden layer;
step 1.2: and then adding the transformed radical vector and the character vector to obtain character-radical vector representation:
Figure BDA0003931485630000044
step 1.3: finally, outputting a final result through a dropout layer and a normalization layer, and fusing the character sequence and the radical sequence to generate a vector
Figure BDA0003931485630000045
Further, the step 2 comprises the following specific steps:
for a given text of n characters, its character sequence is C = { C = { C } 1 ,c 2 ,…,c n Matching characters with a radical dictionary to obtain a corresponding radical sequence R = { R = } 1 ,r 2 ,…,r n And then inputting the character sequence into an embedding layer of Roberta, inputting the obtained embedded representation into a transform coder, and firstly obtaining the output of k transform in order to inject dictionary information between the kth and the (k + 1) th transform
Figure BDA0003931485630000046
Figure BDA0003931485630000047
Then each pair of character and radical is passed through the radical adapter to obtain character-radical representation, and the ith character->
Figure BDA0003931485630000048
And the ith radical->
Figure BDA0003931485630000049
Pass radical adapter denoted as +>
Figure BDA00039314856300000410
/>
Figure BDA00039314856300000411
Then the sequence obtained by the radical adapter
Figure BDA00039314856300000412
Inputting the output into the remaining 12-k layers of transformers to finally obtain the output T = { T = { (T) } 1 ,t 2 ,…,t n }。
Further, the step 3 comprises the following specific steps:
given the output of the last layer of Roberta, T = { T = { T } 1 ,t 2 ,…,t n First, the score of the predicted sequence is calculated as follows:
O=W o T+b o #(3-5)
then y = { y) for the tag sequence 1 ,y 2 ,…,y n The probability of is defined as shown in:
Figure BDA00039314856300000413
wherein Q is a transfer matrix, and Q is a transfer matrix,
Figure BDA00039314856300000414
representing slave label y i-1 To the label y i Is selected, is selected>
Figure BDA00039314856300000415
Representing a character t i Is predicted as label y i Is based on the score of->
Figure BDA0003931485630000051
All possible tag sequences, the numerator represents the score that the current tag sequence is the correct sequence, and the denominator represents the score of each sequence;
given N tag data
Figure BDA0003931485630000052
By minimizing sentence-level negative log-likelihood lossThe missed training model is as follows:
Figure BDA0003931485630000053
and finally, in the decoding process, finding out the label sequence with the highest score by adopting a Viterbi algorithm, wherein the calculation formula is as follows:
Figure BDA0003931485630000054
wherein y is * The sequence that maximizes the score function is taken among all the tag sequences.
The named entity recognition model and method based on the Roberta radical enhanced adapter have the following advantages: aiming at the defects of the context information in the short text and considering that the radical contains deep semantic information, the model provides a new feature fusion scheme, and combines the radical features to the bottom layer of Roberta to fully fuse the features. And multiple groups of comparison experiments prove that the model has good performance and the bottom layer fused radical characteristics have greater superiority on the two data sets.
Drawings
FIG. 1 is a block diagram of the named entity recognition model based on the Roberta radical enhanced adapter.
Fig. 2 is a diagram of the radical adapter structure of the present invention.
FIG. 3 is a diagram of the original Roberta model architecture.
Fig. 4 is a Roberta diagram of radical enhancement of the present invention.
Detailed Description
In order to better understand the purpose, structure and function of the present invention, the named entity recognition model and method based on Roberta radical enhanced adapter of the present invention are described in further detail below with reference to the accompanying drawings.
As shown in FIG. 1, in the named entity recognition model based on the Roberta radical enhanced adapter, firstly, semantic information is extracted by using the Roberta model, meanwhile, radical features are combined into a transform encoder layer of Roberta through a radical adapter, the radical information and the semantic information are fully interacted in a Roberta bottom layer, and finally, the fused semantic information is input into a CRF layer to be decoded to obtain a final mark sequence.
The Radical text (Radical _ txt) source comprises a Xinhua dictionary and a Baidu Chinese dictionary, and the characters and the radicals are used for generating a dictionary of key-value pairs, so that a basis is provided for subsequent fusion of Radical characteristics. For inputting a sentence containing n characters, the original character sequence C = { C = 1 ,c 2 ,c 3 ,…,c n Is input to Roberta to extract semantic information and match each character in the sentence with the radical dictionary to get the corresponding radical sequence R = { R = 1 ,r 2 ,r 3 ,…,r n And (4) fusing the semantic information into Roberta to carry out deep knowledge interaction, and finally sending the fused semantic information into a CRF decoding layer to obtain a label sequence. The model provided by the invention aims to improve the accuracy of named entity recognition by enhancing the interaction of features, and the structure of the model is described in detail later.
The named entity recognition model based on the Roberta radical enhanced adapter comprises the radical adapter, the radical enhanced Roberta model and a conditional random field.
Radical adapter:
each element in a sentence consists of two types of information, namely character features and radical features. In order to realize deeper feature interaction, a radical adapter is designed, and radical features are sent to the bottom layer of Roberta to fully fuse information. The specific radical adapter structure is shown in fig. 2, the input of which is divided into two parts, namely, a character and a radical, and the radical vector is aligned with the character vector by using double-line attention, then the aligned radical vector is combined with the character vector to obtain a character-radical pair representation, and finally the combined vector representation is output through a normalization layer.
For a text with a character length of n, the character sequence is represented by output vectors of an encoding layer in Roberta
Figure BDA0003931485630000061
The radical information corresponding to the character sequence is encoded into a vector and expressed as the vector ^ er>
Figure BDA0003931485630000062
Figure BDA0003931485630000063
To align these two vector representations, the radical vector is non-linearly transformed, taking the ith element as an example:
Figure BDA0003931485630000064
wherein W 1 Is dimension d c *d r Matrix of W 2 Is dimension d c *d c Matrix of b 1 And b 2 Is an offset term, d r Dimension representing radical embedding, d c Representing the dimensions of the Roberta hidden layer.
And then adding the transformed radical vector and the character vector to obtain character-radical vector representation:
Figure BDA0003931485630000065
finally, outputting a final result through a dropout layer and a normalization layer, and fusing the character sequence and the radical sequence to generate a vector
Figure BDA0003931485630000066
Radical enhanced Roberta model:
the original Roberta model is an improved model based on BERT, not only inherits the advantages of the BERT model, utilizes a transform encoder as an intermediate layer to extract semantic information, but also improves some aspects of the BERT model in order to capture semantic features of more layers, the first improvement is to use a dynamic mask, the BERT model uses a static mask, namely, a masked token is not changed in the training process, and the Roberta adopts the dynamic mask, the masked position is continuously updated in each training, the randomness of model input data is improved, and the learning capability of the model is improved. The second improvement is to remove the Next Sequence Prediction (NSP) task, the NSP task in BERT is used to judge whether the two input sentences are continuous, and Roberta removes the NSP task, and instead uses continuous full-sense and doc-sense as input, so as to increase the length of the input Sentence to 512 characters at most, which is much higher than the maximum input 256 characters of BERT model. Subsequently, a Roberta model based on a full-word mask technology is provided by the Hagong-Daiffei combined laboratory, and is called as a Roberta-wwm-ext model, the model improves the original single-character mask and provides a full-word mask scheme, chinese word segmentation operation in natural language processing is fully considered, word is used as granularity for shielding, and the full-word mask scheme can help to capture semantic features at the Chinese word level, so that the performance of the Roberta model is further improved. An example of a comparison between the single-word mask and full-word mask schemes is shown in table 1.
Table 1 examples of different masking schemes
Figure BDA0003931485630000071
Therefore, the invention uses the Roberta model of the full word mask scheme to extract semantic features, and the model structure diagram is shown in FIG. 3. Adding a special identifier [ CLS ] to the beginning of each sentence in the input part]Between sentences using SEP]The separators are separated, and then the input sequence obtains a sequence representation through three-part embedding, specifically, each input character is formed by adding three parts of token embedding, segment embedding and position embedding, for example, a certain character E in the sequence t The following formula is formed:
E t =E token_emb +E seg_emb +E pos_emb #(3-3)
the most core part of the Roberta model is composed of 12 layers of transform encoders, semantic features can be fully extracted, character dependency is captured, and finally each character vector representation fused with full-text semantic information is output, wherein an output vector corresponding to [ CLS ] serves as semantic representation of the whole text.
Radical Adapter (RA) enhanced Roberta is to inject a Radical Adapter into a layer of Roberta, whose structure is shown in fig. 4. Specifically, the radical adapter is connected between certain transformers inside Roberta, thereby injecting external radical knowledge into Roberta.
For a given text of n characters, its character sequence is C = { C = { C } 1 ,c 2 ,…,c n Matching characters with a radical dictionary to obtain a corresponding radical sequence R = { R = } 1 ,r 2 ,…,r n }. Then inputting the character sequence into the Roberta embedding layer, inputting the obtained embedded representation into a Transformer encoder, and firstly obtaining the output of k transformers in succession in order to inject dictionary information between the kth and the (k + 1) th transformers
Figure BDA0003931485630000072
Then each pair of character and radical gets the character-radical representation through the radical adapter, such as the ith character->
Figure BDA0003931485630000073
And the ith radical->
Figure BDA0003931485630000074
Past radical adapter is denoted as pick>
Figure BDA0003931485630000075
Figure BDA0003931485630000076
Then the sequence obtained by the radical adapter
Figure BDA0003931485630000081
Inputting the output into the remaining 12-k layers of transformers to finally obtain the output T = { T = { (T) } 1 ,t 2 ,…,t n }。
Conditional random field:
conditional Random Fields (CRFs), which are Conditional probability distribution models that output a set of Random variables given a set of input Random variable conditions, are widely used in sequence labeling tasks. The CRF model can fully utilize rich internal and context characteristic information in the process of labeling to find a label sequence path with the maximum probability for an input sequence.
Given the output of the last layer of Roberta, T = { T = { T } 1 ,t 2 ,…,t n First, the score of the predicted sequence is calculated as follows:
O=W o T+b o #(3-5)
then y = { y for tag sequences 1 ,y 2 ,…,y n The probability is defined as shown below:
Figure BDA0003931485630000082
wherein Q is a transfer matrix, and Q is a transfer matrix,
Figure BDA0003931485630000083
indicating slave label y i-1 To the label y i Is selected, is selected>
Figure BDA0003931485630000084
Representing a character t i Is predicted as label y i Is based on the score of->
Figure BDA0003931485630000085
Is all possible tag sequences, the numerator represents the score for the current tag sequence as the correct sequence, and the denominator represents the score for each sequence.
Given N tag data
Figure BDA0003931485630000086
Training by minimizing sentence-level negative log-likelihood lossThe model is as follows:
Figure BDA0003931485630000087
finally, in the decoding process, a Viterbi algorithm is adopted to find out the label sequence with the highest score, and the calculation formula is as follows:
Figure BDA0003931485630000088
wherein y is * The sequence that maximizes the score function is taken among all the tag sequences.
The invention discloses a named entity identification method based on a Roberta radical enhanced adapter, which comprises the following steps:
step 1, utilizing the radical adapter designed by the invention to send radical characteristics into the bottom layer of Roberta to fully fuse characteristic information; the method comprises the following steps of dividing input of a user into two parts of characters and radicals, aligning radical vector representation with character vectors through nonlinear transformation, combining the aligned radical vectors with the character vectors to obtain character-radical pair representation, and outputting a final result of the combined vector representation through a normalization layer, wherein the specific steps are as follows:
(1) Firstly, for a text with a character length of n, the character sequence is represented by output vectors of an encoding layer in Roberta
Figure BDA0003931485630000089
The radical information corresponding to the character sequence is encoded into a vector and expressed as the vector ^ er>
Figure BDA00039314856300000810
To align these two vector representations, the radical vector is non-linearly transformed, taking the ith element as an example:
Figure BDA0003931485630000091
wherein W 1 Is dimension d c *d r Matrix of W 2 Is dimension d c *d c Matrix of b 1 And b 2 Is an offset term, d r Dimension representing radical embedding, d c Representing the dimensions of the Roberta hidden layer.
(2) And then adding the transformed radical vector and the character vector to obtain character-radical vector representation:
Figure BDA0003931485630000092
(3) Finally, outputting a final result through a dropout layer and a normalization layer, and fusing the character sequence and the radical sequence to generate a vector
Figure BDA0003931485630000093
Step 2, strengthening Roberta by radicals; the Roberta enhanced by Radical Adapter (RA) is to connect the Radical Adapter between some transformers inside Roberta, so as to inject the knowledge of the external radicals into Roberta, as follows:
for a given text of n characters, its character sequence is C = { C = { C } 1 ,c 2 ,…,c n Matching characters with a radical dictionary to obtain a corresponding radical sequence R = { R = } 1 ,r 2 ,…,r n }. Then inputting the character sequence into the Roberta embedding layer, inputting the obtained embedded representation into a Transformer encoder, and in order to inject dictionary information between the kth and the (k + 1) th transformers, we first obtain the outputs of k transformers in succession
Figure BDA0003931485630000094
Figure BDA0003931485630000095
Then each pair of character and radical gets the character-radical representation through the radical adapter, such as the ith character->
Figure BDA0003931485630000096
And the ith radical->
Figure BDA0003931485630000097
Past radical adapter is denoted as pick>
Figure BDA0003931485630000098
Figure BDA0003931485630000099
Then the sequence obtained by the radical adapter
Figure BDA00039314856300000910
Inputting the output into the remaining 12-k layers of transformers to finally obtain the output T = { T = { (T) } 1 ,t 2 ,…,t n }。
And 3, finding a label sequence path with the maximum probability for the input sequence by using a Conditional Random Field (CRF). The method comprises the following specific steps:
given the output T = { T) of the last layer of Roberta 1 ,t 2 ,…,t n First, the score of the predicted sequence is calculated as follows:
O=W o T+b o #(3-5)
then y = { y) for the tag sequence 1 ,y 2 ,…,y n The probability is defined as shown below:
Figure BDA00039314856300000911
wherein Q is a transfer matrix, and Q is a transfer matrix,
Figure BDA00039314856300000912
indicating slave label y i-1 To the label y i Is selected, is selected>
Figure BDA00039314856300000913
Representing a character t i Is predicted as label y i In the score of (c), in the score of (c)>
Figure BDA00039314856300000914
Is all possible tag sequences, the numerator represents the score for the current tag sequence as the correct sequence, and the denominator represents the score for each sequence.
Given N tag data
Figure BDA0003931485630000101
The model is trained by minimizing sentence-level negative log-likelihood loss as follows:
Figure BDA0003931485630000102
finally, in the decoding process, a Viterbi algorithm is adopted to find out the label sequence with the highest score, and the calculation formula is as follows:
Figure BDA0003931485630000103
wherein y is * The sequence that maximizes the score function is taken among all the tag sequences.
Procedure of experiment
1 data set of experiments
The experiment used the chinese medical dataset CCKS2017 and the chinese Resume dataset Resume. The CCKS2017 dataset, labeled with 5 entity types, examination, signs of symptoms, disease diagnosis, treatment and body part, was divided into training and testing sets in the ratio of 5. The Resume data set is labeled with 8 entity types, namely nationality, education institution, address, name, organization name, specialty, ethnicity and job, and is divided into a training set, a verification set and a test set according to the proportion of 8.
TABLE 2 CCKS2017 data set entity types and numbers
Figure BDA0003931485630000104
TABLE 3 Resume data set entity types and numbers
Figure BDA0003931485630000105
Figure BDA0003931485630000111
2 evaluation index
The accuracy (Precision, P), recall (Recall, R) and F1 value evaluation indexes are adopted in the experiment to measure the quality of the effect of the named entity recognition model, and the calculation formula is as follows:
Figure BDA0003931485630000112
Figure BDA0003931485630000113
Figure BDA0003931485630000114
where TP represents the number of positive samples determined to be positive, FP represents the number of negative samples determined to be positive, and FN represents the number of positive samples determined to be negative. P represents the proportion of correctly predicted results to all predicted results, R represents the proportion of correctly predicted results to all data, and the F1 value is the harmonic mean of P and R.
3 Experimental Environment and parameter settings
The named entity recognition model of the experiment is based on a Pythrch framework, and the specific experimental environment setting is shown in Table 4.
TABLE 4 Experimental Environment
Figure BDA0003931485630000115
The detailed parameter settings for the experiments are as follows: the Roberta model using the full word mask method comprises 12 layers of transformers, and a radical adapter is added between the first layer and the second layer of the Roberta model. Roberta has a hidden layer dimension of 768, a maximum sequence length of 256, an initial learning rate of 1e-5 and using an Adam optimizer, a batch size of 30, and a number of epochs of 30 training on all datasets.
4 experimental comparison and result analysis
(1) Pre-training model comparisons
First to demonstrate the effectiveness of the Roberta model, roberta was compared with BERT in two data sets, and the experimental comparison results are shown in table 5.
TABLE 5 Pre-training model comparisons
Figure BDA0003931485630000121
From tables 3-5, it can be seen that Roberta extracts entities better than BERT on both datasets, specifically, the F1 values of Roberta model at CCKS2017 and Resume are 0.77% and 0.49% higher than BERT model, respectively. Thus, the more powerful Roberta model is used as a baseline model for named entity identification.
(2) Compared with the existing research method
Many researchers have conducted named entity recognition studies on CCKS2017 datasets, first, li et al, using the BiLSTM-CRF model in combination with word embedding in the professional field, by additional lexical features to improve the effect of named entity recognition. Wang et al constructs a domain-related dictionary and integrates dictionary features into a BilSTM-CRF model to improve the effect of named entity recognition. Qiu et al input character and dictionary features into residual augmented convolutional neural network (RDCNN) to capture context featuresThe dependencies between adjacent tags are then captured by the CRF. Tang et al propose a deep learning method integrating a language model and an attention mechanism, which extracts semantic features by using a bidirectional gated cyclic network (BiGRU) and a pre-training language model, and then further captures the features by using another layer of BiGRU and an attention module. Subsequently, because the radical information in the Chinese character also contains certain semantic information, yin and the like utilize a Convolutional Neural Network (CNN) to extract radical features, combine the radical features with character representation, perform modeling through a BilSTM model, and then capture long-term dependency relationship between a single character and the context through an attention mechanism. Wu et al [22] Capturing the characteristics of the radicals by using a BilSTM model, fully extracting character representation containing context semantic information by using Roberta, and splicing the characteristics of the radicals and the characteristics of the characters to obtain an optimal label sequence through CRF.
TABLE 6 comparison of existing methods on CCKS2017 data set
Figure BDA0003931485630000122
Figure BDA0003931485630000131
Specific method results the named entity recognition model proposed by the present invention achieves the best results among all models, as shown in table 6. Li et al model performs poorly because segmentation errors inevitably occur with the word segmentation method at the chosen word level, resulting in subsequent erroneous recognition of word boundaries. Wang and Qiu, etc. enhance certain entity extraction capability by adding dictionary features, and Tang, etc. further capture the dependency relationship among tags by the superposition of multilayer models. Yin and the like combine radical information to realize a higher F1 value, and prove the effectiveness of the radical characteristics. Wu and the like combine the characteristics of the radicals with the characteristics extracted by the Roberta model, and prove that the pre-training-based language model is beneficial to further improving the performance of entity recognition. However, the models are based on modularized fusion, the model inputs the radical characteristics to the Roberta bottom layer to fully fuse the characteristics, and the experimental result proves that the bottom layer characteristic fusion can further improve the model performance.
Many researchers have also conducted named entity recognition research on the Resume data set, and Zhang et al proposed a named entity recognition method (Lattice-LSTM) that fuses vocabulary and character information in consideration of the problem that the word granularity-based model cannot utilize vocabulary information. Li and the like provide a FLAT model aiming at the problems that the structure of the Lattice model is too complex and difficult to utilize, the Lattice model converts the Lattice structure into a planar structure consisting of spans, and position coding is added to improve the parallel capability. Li et al also combine FLAT with BERT pre-training models to further improve the model identification effect. Wei et al integrate external dictionary knowledge into the BERT layer, and depth fusion of features at the bottom layer enhances the performance of the model.
TABLE 7 comparison of existing methods on the Resume dataset
Figure BDA0003931485630000132
The results of the specific method comparison are shown in Table 7, and the following results can be obtained. The combination of features and a pre-training model can enhance the recognition capability of the model, for example, li and the like improve the F1 value by 0.93 percent after BERT combination and FLAT combination. Wei et al fuse features into the BERT pre-training model bottom layer, and the experimental results show that the bottom layer fusion features can further enhance entity recognition performance. The invention considers that the Chinese radicals also contain deep semantic information, the radical information is merged into Roberta, the optimal performance is realized in a comparison experiment, and the effectiveness of the radical characteristics and the superiority of bottom layer merging characteristics are proved.
(3) Ablation experiment
Results were compared to Roberta-CRF with the addition of the radical adapter, using Roberta-CRF as the baseline model. The results on the CCKS2017 dataset are shown in Table 8 and contain a score for each entity class. It can be seen that the identification of the "examination" and "symptom sign" categories is the best, specifically, the F1 values at baseline are 95.78% and 96.30%, respectively, and the F1 values of the method of the present invention are 96.90% and 98.17%, respectively, but the entity identification of the "treatment" category is poor, and the F1 values at baseline and the method of the present invention are 63.63% and 71.17%, respectively, which may affect the learning ability of the neural network due to the difference in the sample size of the categories. The F1 values of the model of the invention were significantly higher in the "examination", "signs of symptoms" and "treatment" categories than in the baseline model, and effects similar to baseline were also achieved in the "diagnosis of disease" and "treatment" categories of the model of the invention. Finally, the model of the invention achieves better results in the overall category, with F1 values 2.20% higher than the baseline model, which demonstrates the effectiveness of the underlying fused radical features.
TABLE 8 comparison of classes of entities and populations on CCKS2017 data set
Figure BDA0003931485630000141
And the present invention further compares the baseline model with the proposed model on the Resume dataset, as shown in table 9, since the distribution of the entities of each category of the dataset is not uniform, only the overall score is compared. It can be seen that the F1 value of the model of the present invention is higher than the baseline model by 0.62%, thereby further verifying the effectiveness of the underlying fused radical feature.
TABLE 9 comparison of baseline and invention models on Resume data set
Figure BDA0003931485630000151
The invention provides a named entity recognition model and a named entity recognition method based on a Roberta radical adapter. Aiming at the situation that the context information in the short text is insufficient and the fact that the radicals contain deep semantic information is considered, the model provides a new feature fusion scheme, and the features of the radicals are combined to the bottom layer of Roberta to fully fuse the features. And the performance of the model and the superiority of the bottom-layer fusion radical characteristics are proved through a plurality of groups of comparison experiments on the two data sets.
It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (8)

1. A named entity recognition model based on a Roberta radical enhanced adapter is characterized by comprising a radical adapter, a radical enhanced Roberta model and a conditional random field; the radical adapter is used for sending radical characteristics into bottom-layer full fusion information of Roberta; the radical enhancement Roberta model is used for extracting semantic features by using a Roberta model of a full word mask scheme; the conditional random field is used to output a conditional probability distribution model of another set of random variables given a set of input random variable conditions.
2. The Roberta radical enhanced adapter based named entity recognition model according to claim 1, characterized in that the radical adapter comprises means to perform the following steps:
the radical adapter input is divided into two parts of characters and radicals, a radical vector is aligned with a character vector by using double-line attention, then the aligned radical vector is combined with the character vector to obtain a character-radical pair representation, and finally the combined vector representation is subjected to a normalization layer to output a final result;
for a text with a character length of n, the character sequence is represented by output vectors of an encoding layer in Roberta
Figure FDA0003931485620000011
The character sequence corresponds toIs encoded as a vector and expressed as a vector @>
Figure FDA0003931485620000012
To align these two vector representations, the radical vector is non-linearly transformed, with the ith element:
Figure FDA0003931485620000013
wherein W 1 Is dimension d c *d r Matrix of W 2 Is dimension d c *d c Matrix of b 1 And b 2 Is an offset term, d r Dimension representing radical embedding, d c Represents the dimension of the Roberta hidden layer;
and then adding the transformed radical vector and the character vector to obtain character-radical vector representation:
Figure FDA0003931485620000014
finally, outputting a final result through a dropout layer and a normalization layer, and fusing the character sequence and the radical sequence to generate a vector
Figure FDA0003931485620000015
3. The Roberta-radical-enhanced adapter-based named entity recognition model according to claim 1, wherein the radical-enhanced Roberta model comprises means for performing the following steps:
adding a special identifier [ CLS ] to the beginning of each sentence in the input part]Between sentences using SEP]Separating the separators, then embedding the input sequence by three parts to obtain sequence representation, wherein each input character is formed by adding token embedding, segment embedding and position embedding, and the character E in the sequence t The following formula is formed:
E t =E token_emb +E seg_emb +E pos_emb #(3-3)
the most core part of the Roberta model is composed of 12 layers of transform encoders, wherein output vectors corresponding to [ CLS ] are used as semantic representations of the whole text;
radical adapter enhanced Roberta is to inject a radical adapter into a layer of Roberta, connect the radical adapter between certain transformers inside Roberta, and thus inject external radical knowledge into Roberta;
for a given text of n characters, its character sequence is C = { C = { C = 1 ,c 2 ,…,c n Matching characters with a radical dictionary to obtain a corresponding radical sequence R = { R = } 1 ,r 2 ,…,r n And then inputting the character sequence into an embedding layer of Roberta, inputting the obtained embedded representation into a transform coder, and firstly obtaining the output of k transform in order to inject dictionary information between the kth and the (k + 1) th transform
Figure FDA0003931485620000021
Figure FDA0003931485620000022
Then each pair of character and radical is passed through the radical adapter to obtain character-radical representation, and the ith character->
Figure FDA0003931485620000023
And the ith radical->
Figure FDA0003931485620000024
Past radical adapter is denoted as pick>
Figure FDA0003931485620000025
Figure FDA0003931485620000026
/>
Then the sequence obtained by the radical adapter
Figure FDA0003931485620000027
Inputting the data into the remaining 12-k layer transformers to finally obtain the output T = { T = { (T) } 1 ,t 2 ,…,t n }。
4. The Roberta-radical-enhanced-adapter-based named entity recognition model of claim 1, wherein the conditional random field comprises means for performing the following steps:
given the output of the last layer of Roberta, T = { T = { T } 1 ,t 2 ,…,t n H, first calculate the score of the predicted sequence as:
O=W o T+b o #(3-5)
then y = { y) for the tag sequence 1 ,y 2 ,…,y n The probability is defined as shown below:
Figure FDA0003931485620000028
wherein Q is a transfer matrix, and Q is a transfer matrix,
Figure FDA0003931485620000029
indicating slave label y i-1 To the label y i Is selected, is selected>
Figure FDA00039314856200000210
Representing a character t i Is predicted as label y i In the score of (c), in the score of (c)>
Figure FDA00039314856200000211
All possible tag sequences, the numerator represents the score that the current tag sequence is the correct sequence, and the denominator represents the score of each sequence;
given N tag data
Figure FDA00039314856200000212
The model is trained by minimizing sentence-level negative log-likelihood loss as follows:
Figure FDA00039314856200000213
and finally, in the decoding process, finding out the label sequence with the highest score by adopting a Viterbi algorithm, wherein the calculation formula is as follows:
Figure FDA00039314856200000214
wherein y is * The sequence that maximizes the score function is taken among all the tag sequences.
5. A method for named entity recognition using the Roberta radical enhanced adapter based named entity recognition model according to any of claims 1-4, comprising the steps of:
step 1, utilizing a radical adapter to send radical characteristics to the bottom layer of Roberta to fully fuse characteristic information; the method comprises the steps that input of a user is divided into two parts, namely, a character and a radical, the radical vector representation is aligned with a character vector through nonlinear transformation, then the aligned radical vector is combined with the character vector to obtain a character-radical pair representation, and finally the combined vector representation is output through a normalization layer to obtain a final result;
step 2, radical enhancement Roberta: connecting a radical adapter between transformers inside Roberta, thereby injecting external radical knowledge into Roberta;
and 3, finding a label sequence path with the maximum probability for the input sequence by using the conditional random field.
6. The method according to claim 5, wherein the step 1 comprises the following specific steps:
step 1.1: firstly, for a text with a character length of n, the character sequence is represented by output vectors of an encoding layer in Roberta
Figure FDA0003931485620000031
The radical information corresponding to the character sequence is encoded into a vector and expressed as the vector
Figure FDA0003931485620000032
To align these two vector representations, the radical vector is non-linearly transformed, with the ith element:
Figure FDA0003931485620000033
wherein W 1 Is dimension d c *d r Matrix of W 2 Is dimension d c *d c Matrix of b 1 And b 2 Is an offset term, d r Dimension representing radical embedding, d c Represents the dimension of the Roberta hidden layer;
step 1.2: and then adding the transformed radical vector and the character vector to obtain character-radical vector representation:
Figure FDA0003931485620000034
step 1.3: finally, outputting a final result through a dropout layer and a normalization layer, and fusing the character sequence and the radical sequence to generate a vector
Figure FDA0003931485620000035
7. The method according to claim 5, wherein the step 2 comprises the following specific steps:
for a given text of n characters, the characters thereofThe sequence is C = { C 1 ,c 2 ,…,c n Matching the characters with a radical dictionary to obtain a corresponding radical sequence R = { R } 1 ,r 2 ,…,r n And then inputting the character sequence into an embedding layer of Roberta, inputting the obtained embedded representation into a transform coder, and firstly obtaining the output of k transform in order to inject dictionary information between the kth and the (k + 1) th transform
Figure FDA0003931485620000036
Figure FDA0003931485620000037
Then each pair of character and radical is passed through the radical adapter to obtain character-radical representation, and the ith character->
Figure FDA0003931485620000038
And the ith radical->
Figure FDA0003931485620000039
Past radical adapter is denoted as pick>
Figure FDA00039314856200000310
Figure FDA00039314856200000311
Then the sequence obtained by the radical adapter
Figure FDA00039314856200000312
Inputting the output into the remaining 12-k layers of transformers to finally obtain the output T = { T = { (T) } 1 ,t 2 ,…,t n }。
8. The method according to claim 5, wherein the step 3 comprises the following specific steps:
given Routput T = { T) of last layer of oberta 1 ,t 2 ,…,t n First, the score of the predicted sequence is calculated as follows:
O=W o T+b o #(3-5)
then y = { y) for the tag sequence 1 ,y 2 ,…,y n The probability is defined as shown below:
Figure FDA0003931485620000041
wherein Q is a transfer matrix, and Q is a transfer matrix,
Figure FDA0003931485620000042
representing slave label y i-1 To the label y i The conversion score of (a) is calculated, device for selecting or keeping>
Figure FDA0003931485620000043
Representing a character t i Is predicted as label y i In the score of (c), in the score of (c)>
Figure FDA0003931485620000044
All possible tag sequences, the numerator represents the score that the current tag sequence is the correct sequence, and the denominator represents the score of each sequence;
given N tag data
Figure FDA0003931485620000045
The model is trained by minimizing sentence-level negative log-likelihood loss as follows:
Figure FDA0003931485620000046
and finally, in the decoding process, finding out the label sequence with the highest score by adopting a Viterbi algorithm, wherein the calculation formula is as follows:
Figure FDA0003931485620000047
wherein y is * The sequence that maximizes the score function is taken among all the tag sequences.
CN202211389670.8A 2022-11-08 2022-11-08 Named entity recognition model and method based on Roberta radical enhanced adapter Pending CN115859978A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211389670.8A CN115859978A (en) 2022-11-08 2022-11-08 Named entity recognition model and method based on Roberta radical enhanced adapter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211389670.8A CN115859978A (en) 2022-11-08 2022-11-08 Named entity recognition model and method based on Roberta radical enhanced adapter

Publications (1)

Publication Number Publication Date
CN115859978A true CN115859978A (en) 2023-03-28

Family

ID=85662712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211389670.8A Pending CN115859978A (en) 2022-11-08 2022-11-08 Named entity recognition model and method based on Roberta radical enhanced adapter

Country Status (1)

Country Link
CN (1) CN115859978A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341557A (en) * 2023-05-29 2023-06-27 华北理工大学 Diabetes medical text named entity recognition method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341557A (en) * 2023-05-29 2023-06-27 华北理工大学 Diabetes medical text named entity recognition method

Similar Documents

Publication Publication Date Title
Xu et al. Read, listen, and see: Leveraging multimodal information helps Chinese spell checking
Yu et al. Bridging text and knowledge with multi-prototype embedding for few-shot relational triple extraction
Ayana et al. Recent advances on neural headline generation
CN114530223A (en) NLP-based cardiovascular disease medical record structuring system
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
Wan et al. A self-attention based neural architecture for Chinese medical named entity recognition
Qu et al. A survey on arabic named entity recognition: Past, recent advances, and future trends
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN115859978A (en) Named entity recognition model and method based on Roberta radical enhanced adapter
Gao et al. A joint extraction model of entities and relations based on relation decomposition
CN113408307B (en) Neural machine translation method based on translation template
CN112784601B (en) Key information extraction method, device, electronic equipment and storage medium
Tan et al. Chinese medical named entity recognition based on Chinese character radical features and pre-trained language models
Feng et al. Multi-level cross-lingual attentive neural architecture for low resource name tagging
Zhang et al. Research on named entity recognition of chinese electronic medical records based on multi-head attention mechanism and character-word information fusion
Xiang et al. A cross-guidance cross-lingual model on generated parallel corpus for classical Chinese machine reading comprehension
CN114756679A (en) Chinese medical text entity relation combined extraction method based on conversation attention mechanism
Xie et al. Automatic chinese spelling checking and correction based on character-based pre-trained contextual representations
Ren et al. Extraction of transitional relations in healthcare processes from Chinese medical text based on deep learning
Ma et al. A Named Entity Recognition Method Enhanced with Lexicon Information and Text Local Feature
Bharti et al. Sarcasm as a contradiction between a tweet and its temporal facts: a pattern-based approach
CN111046665A (en) Domain term semantic drift extraction method
Sun et al. Domain adaptation for medical semantic textual similarity
Chen et al. BiLSTM-based with word-weight attention for Chinese named entity recognition
CN116776887B (en) Negative sampling remote supervision entity identification method based on sample similarity calculation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination