CN111191453A

CN111191453A - Named entity recognition method based on confrontation training

Info

Publication number: CN111191453A
Application number: CN201911358738.4A
Authority: CN
Inventors: 袁超逸; 刘忠麟; 王立才; 张起闻; 罗琪彬; 郝韫宏; 李孟书
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-22

Abstract

The invention discloses a named entity recognition method based on confrontation training, which obtains the correlation characteristics between characters in the judicial field through RoBERTA model training and Bi-LSTM training respectively; then splicing the two relevance characteristics together, and predicting the training sample by using a conditional random field model to obtain a predicted result; the method can introduce the combination of external word vectors and word vectors with different dimensions and judicial domain text word mixed vectors with different dimensions, and can resist disturbance aiming at the mixed word vectors in the judicial domain text, thereby increasing the accuracy of model identification.

Description

Named entity recognition method based on confrontation training

Technical Field

The invention belongs to the technical field of named entity recognition, and particularly relates to a named entity recognition method based on countermeasure training.

Background

Named entity recognition has been widely applied in various fields, various fields are optimized in different degrees aiming at named entity recognition, a large number of personnel are required to be consumed in the traditional named entity recognition to perform feature extraction aiming at specific fields, a probabilistic graph model is used for named entity recognition, with the rise of deep learning in recent years, various fields are greatly explored for named entity recognition by using a deep learning method, at present, a large number of exploration and practice are performed in the financial, medical and legal fields, a large amount of labor cost is reduced, the accuracy is improved, how to use the information is particularly critical, entities with specific meanings in certain specific fields, such as in judicial texts (suspects, reports, original reports and the like), can be extracted for information later through the recognition of the entities, The question-answering system, the syntactic analysis, the knowledge reasoning, the construction of knowledge maps and other important tasks lay important foundations.

Currently, the main methods of named entity recognition in the judicial field fall into three main categories:

the first type is based on a probability map model, and the method mainly uses a Conditional Random Field (CRF) model which is a conditional probability distribution model of another group of output sequences under the condition of giving a group of input sequences, and manually extracts corresponding characteristics and sets corresponding rules by inputting labeled specific field data so as to identify unlabeled texts.

The second kind of deep learning-based method mainly uses a bidirectional long-time memory network (Bi-LSTM) model, utilizes word vector embedded information, and greatly reduces manual work by inputting tagged specific field data into the Bi-LSTM, and can obtain higher accuracy.

The third kind is based on deep learning and is combined with traditional method, the method utilizes Word vector training method (Word2Vec) or (GloVe) technology to give text of specific field, Word list of specific field, in specific fieldIn the field text, we build a language model through the text, and the language model is built through P (w)₁,w₂,…w_n)＝P(w₁)P(w₂|w₁)P(w_n|w₁,…,w_n-1) Converting the joint probability into conditional continuous multiplication, greatly reducing parameters by using Markov hypothesis, inputting a segment of word vector in a probability model corresponding to each word in a word list, outputting the joint probability of a text, learning the weight of the word vector, and constructing a simple neural network f (w)_t-n+1…,w_t)f(w_t-n+1…,w_t) To fit the conditional probability P (w)_t|w₁,…,w_t-1) Inputting word vectors into a linear Embedding layer (Embedding) layer in the model, acquiring the word vectors of the text in the specific field by setting different sliding windows through the whole text in the specific field by using a trainable parameter matrix C, and acquiring corresponding word vectors (word vectors), wherein two methods are respectively used for training a Skip-word model (Skip-gram) or a continuous word bag model (CBOW), after the corresponding word vectors are acquired, the word vectors in the Bi-LSTM layer are input through a Bi-LSTM layer and pass through the hidden state of each time point, so that the representation of a context can be acquired, and the final characteristics utilize surrounding information through a CRF layer so as to effectively acquire corresponding labels, wherein the model is shown in figure 1.

The existing named entity model in a specific field utilizes a mode of combining Bi-LSTM and CRF models, but the capability of extracting features of the models is not strong enough, and modeling in Bi-LSTM is only simple to perform modeling from left to right or from right to left, and hidden states are spliced together, but the disadvantage of this is that only the information of the upper part or the lower part can be utilized, and the information of the upper part and the lower part cannot be utilized simultaneously. In addition, the number and quantity of texts in a specific field are limited, and a large amount of data is not available for improving the model performance.

With the appearance of the BERT model, applications have been gradually performed in various fields, but no corresponding applications have been obtained in specific fields, and words brought by BERT and a subsequent model RoBERTa are mutually independent, so that the disadvantages of loss of model performance and the like are brought during fine adjustment, the scale of data is large, and the accuracy of the model cannot be basically improved.

Disclosure of Invention

In view of the above, the invention provides a named entity recognition method based on countermeasure training, which can introduce external word vectors and word vectors of different dimensions to be combined with text word and word mixed vectors of different dimensions in the judicial field, and perform countermeasure disturbance on the mixed word vectors in the judicial field text, so as to increase the accuracy of model recognition.

The technical scheme for realizing the invention is as follows:

a named entity recognition method based on confrontation training comprises the following steps:

firstly, segmenting a referee document in the judicial field into single characters serving as training samples, and training through a RoBERTA model to obtain relevance characteristics among the characters in the judicial field;

cutting the judge text in the judicial field into single characters and phrases, converting the phrases into Word vectors by using a Word2Vec method, and converting the single characters into Word vectors based on characters by using a Fastext method; introducing Word vectors obtained by using a Word2Vec method outside the judicial field, and introducing Word vectors based on characters obtained by using a Fastext method outside the judicial field; mixing all the word vectors;

step two, disturbing the mixed word vector matrix, finding the disturbance in the worst case through the maximum value of a loss function, and obtaining the optimal robust parameter of the model by utilizing the minimization of the external experience risk so as to obtain the word vector after disturbance-resistant optimization;

step three, inputting the word vector obtained in the step two into the Bi-LSTM by using a sliding window with the length of a, and obtaining the relevance characteristics between the words in the judicial field through the training of the Bi-LSTM;

and step four, splicing the two relevance characteristics obtained in the step one and the step three together, and then predicting the training sample by using a conditional random field model to obtain a predicted result.

Further, 1000< a < 2000.

Has the advantages that:

1. the method introduces the combination of external word vectors and word vectors with different dimensions and text word mixed vectors in the judicial fields with different dimensions, and enriches training samples for the recognition of named entities in the judicial fields.

2. The method comprises the steps of extracting features of a text in the judicial field by using RoBERTA, fusing the extracted features with word vectors with different dimensions, and combining the fused features with Bi-LSTM features to obtain corresponding features, and obtaining a result by using CRF.

3. The method aims at the mixed word vector in the text in the judicial field to resist disturbance, and increases the generalization capability and robustness of the model.

Drawings

FIG. 1 is a diagram of the Bi-LSTM architecture.

Fig. 2 is a schematic diagram of RoBERTa model architecture.

FIG. 3 is a diagram of the word vector model according to the present invention.

FIG. 4 is a diagram of the named entity recognition model architecture of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention provides a named entity recognition method based on countermeasure training, which comprises the following specific processes as shown in figure 4:

firstly, the invention introduces a RoBERTA model in the judicial field, firstly, corresponding word segmentation is carried out on each text in the judicial field, the words are input into the RoBERTA in the form of characters, different weights are distributed to different words through a self-attention mechanism (self-attention), namely, an input matrix is assumed to be X, the maximum word embedding vector is 512, and different weight matrixes W are adopted_q，W_k，W_vFinally, obtaining a self-attention matrix Z through softmax, obtaining a plurality of expression subspaces of an attention layer through a multi-head mechanism, finally splicing different matrices Z, and extracting corresponding features C through dynamic masks of partial words, as shown in fig. 2:

in the judicial field, the text size of the corresponding judicial field is not so large, and only limited data existThe invention introduces Fastext based on words and Word2Vec based on words, constructs the Embedding layer, constructs the text of the judicial field into N-1 one-hot Word vectors, passes all the one-hot vectors through an NxV matrix, N is the dimension set by the user, V is the size of the dictionary, obtains the vector addition, the averaging and the multiplication by the output weight matrix to obtain the corresponding probability distribution, wherein the NxV matrix is a Word and Word vector matrix W₁And W₂Different word vectors and word vector dimensions are specified, the word vector matrix based on the words makes up the characteristic of less professional vocabularies in the judicial field, and the word vector matrix based on the words is the words in the judicial field, so that more accurate priori knowledge can be provided, and the external larger universal word vector matrix W is introduced₃And word vector matrix W₄Are spliced together [ W ]₁,W₂,W₃,W₄]The feature vector with rich information is obtained, the characteristics that the number of texts in the judicial field is small and a better effect cannot be obtained are overcome, and the model is shown as the graph 3:

step two, using the countermeasure training of the mixed word vector matrix to disturb the word vector, and assuming the mixed word vector matrix [ v [ ]₁,v₂,…v_T]For x, perturb the mixed word vector matrix, γ_adv＝∈·g/||g||₂,

And optimize the function

Finding disturbance by an internal max function, finding an optimal robust parameter by an external min function, solving the non-convex constraint optimization problem of the internal max by a Fast Gradient Method in a formula with L as a loss function, and finally obtaining a corresponding result, wherein gamma is_advFor the value of the perturbation, e is the coefficient of the perturbation, g is the gradient over x,

is the range of the sample, y is the predicted value, θ is the parameter of the classifier, E is the empirical risk function, S is the range of the disturbance, f_θA function mapped for a language model coder.

Step three, the Bi-LSTM model can increase the feature number of the context hidden vector through different windows, the spliced word vector is input into the Bi-LSTM, and the word vector passes through a forgetting gate f_t＝σ(W_f*[h_t-1,x_t]+b_f) To judge whether to forget the old information, and then to input the old information through the input gate i_t＝σ(W_i*[h_t-1,x_t]+b_i) Updating the values using the sigmoid function and constructing new candidate values

Then through the refresh door

To decide whether to update the state, finally we need the output gate o_t＝σ(W_o*[h_t-1,x_t]+b_o),h_t＝o_t*tanh(C_t) To obtain a corresponding probability distribution, x in the formula_tFor a matrix of vectors of words or words input in sequence, f_tValue obtained through forget gate, i_tFor the value obtained through the input gate,

to pass the updated candidate after the gate, C_tTo pass the updated state value after the gate, o_tIs the value of output, h_tFor the current hidden state, the sigmoid function is a function that maps variables between 0 and 1, σ is the sigmoid function, tanh function compresses values between-1 and 1, h_t-1Is the hidden state at the previous moment, b is the bias term, W is the weight matrix, C_t-1In order to be in the last memory state,

for the current memory state, a hidden state h is obtained by constructing a language model from left to right and from right to left_t1And h_t2The hidden state of the spliced two is H_t。

Fourthly, splicing the characteristic C extracted by the RoBERTA model and the hidden state obtained by the Bi-LSTM to obtain a characteristic matrix [ C, H_t]C is the feature extracted between words, H_tThe method solves the problem of independent hypothesis test brought by the RoBERTA model for the hidden state of post-splicing by the Bi-LSTM model, and supplements the characteristic loss caused by the word number limitation of the RoBERTA model. The CRF may obtain the named entity recognition result by using the viterbi algorithm in consideration of the constraint relationship between the labels.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A named entity recognition method based on countermeasure training is characterized by comprising the following steps:

2. The method of claim 1, wherein 1000< a < 2000.