CN114510943B

CN114510943B - Incremental named entity recognition method based on pseudo sample replay

Info

Publication number: CN114510943B
Application number: CN202210150846.8A
Authority: CN
Inventors: 夏宇; 李素建
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2024-05-28
Anticipated expiration: 2042-02-18
Also published as: CN114510943A

Abstract

The invention discloses an incremental named entity recognition method based on pseudo sample replay, which is the basis of a knowledge graph construction technology and belongs to the technical field of information extraction in natural language processing. In the learning stage, the training set only comprising new entity types is given, the old model is used as a teacher, and knowledge distillation loss is increased on the conventional cross entropy loss when the new student model is trained; in the review phase, pseudo-samples are generated as review material for the old type, and old friend knowledge is warmed by further distillation on the review material and integrated with the new knowledge. The invention uses the pseudo sample of the old type to provide the new type of supervision signals for the review material, uses the teacher to provide the old type of supervision signals, and can use the supervision signals to restrict the output of the new student model on the review material after the new and old types of supervision signals exist.

Description

Incremental named entity recognition method based on pseudo sample replay

Technical Field

The invention provides an incremental named entity recognition technology, in particular relates to a named entity recognition method based on pseudo sample replay, is the basis of a knowledge graph construction technology, and belongs to the technical field of information extraction in natural language processing.

Background

Traditional named entity recognition refers to extracting entities of a specified category (e.g., person names, place names, organization names) from unstructured text, and is one of the important steps of information extraction. Conventional approaches are limited to extracting entities of predefined categories, however in reality, the categories of entities to be extracted tend to dynamically expand with demand, e.g., new intents are sometimes encountered in dialog systems, with new entity types being introduced, which requires the model to be able to identify a dynamically expanding set of entity types. To adapt to the above scenario, a simple approach is to label all the entity types seen with a dataset and use it to train a new model, however this approach is too demanding for labeling and consumes too much computational resources, even not viable in scenarios with particularly many entity types. Monaikul et al then propose a setup with low requirements for labeling and computational resources by providing only one dataset labeled with new entity types at a time and training a new model with knowledge of the old type entities in the old model.

This learning paradigm is also referred to as continuous learning (lifetime learning, incremental learning), and more specifically, belongs to the category incremental continuous learning. However, continuous learning techniques still have a certain gap from practical applications, where the biggest challenge is the problem of catastrophic forgetfulness, which is the dramatic drop in the performance of models on old tasks when learning new tasks. The cause of the catastrophic forgetfulness problem is: unlike humans, neural networks store task knowledge via parameters, which inevitably update to parameters associated with old tasks as it learns new entity types, thus resulting in a drop in the performance of the old tasks. In addition to catastrophic forgetting problems, class incremental continuous learning also faces class confusion problems, which means that models cannot distinguish between different classes well, the problem arises because: samples of different categories appear in different tasks, only part of the categories are seen by the model each time it is trained, and all the categories are not modeled at the same time.

Because of the lack of a unified reference dataset to measure named entity recognition in a continuous learning scene, the settings of related works are chaotic, the settings proposed by Monaikul et al are most suitable for an actual application scene, monaikul et al convert the named entity recognition dataset in the existing traditional scene into the settings of category increment: suppose that in step k, the goal is to learn a new set of entity typesProvided training data set/>Only the label belonging to/>Other old types of entities need not be labeled. In order to learn a new model and not forget an old model, monaikul takes the old model as a teacher, and knowledge distillation loss is added to the conventional cross entropy loss when training the new student model, and the purpose of the knowledge distillation loss is to restrict the output of the student model on the old model by using the output of the teacher model so as to prevent the student model from forgetting the old model. Although the above method has been successful initially, it has the following drawbacks: this distillation-based approach relies on training data sets/>Number of middle-old type entities, if/>Without the old type of entity, it is difficult for the teacher model to distill old knowledge into the student model.

Disclosure of Invention

In order to solve the problems of catastrophic forgetfulness and category confusion, the invention proposes a two-stage training framework Learn-and-Review (L & R), which is inspired by the human learning process, introducing a "Review stage" after the conventional "learning stage".

The technical scheme provided by the invention is as follows:

Referring to fig. 1, the named entity recognition method based on pseudo sample replay provided by the invention is characterized by comprising a learning stage and a review stage, wherein in the learning stage, a training set only comprising new entity types is given, an old model is used as a teacher, and knowledge distillation loss is increased on the basis of conventional cross entropy loss when a new student model is trained; in the review phase, pseudo-samples are generated as review material for the old type, and old friend knowledge is warmed by further distillation on the review material and integrated with the new knowledge; the method specifically comprises the following steps:

1) In the learning stage, in the kth step, a current dataset D _k and an M _k-1,G_1:k-1 model obtained in the last step are obtained;

2) M _k-1 is regarded as a teacher, Is considered as a student and distills knowledge of the old entity type in M _k-1 to/>, by knowledge distillationIn (a) and (b);

3) In the review phase, for each old task i ε {1,2, …, k-1}, unlabeled text is generated that contains the old type E _i

4) Feeding the unlabeled text to M _k-1 and the student obtained in the first stageObtaining the output probability distribution P (x _i;θ_k-1, T) and/>, over all seen entity types

5) Taking the front in the output distribution of M _k-1 Dimension/>Output distribution of/>To the firstDimension, splice them to get/>

6) After the review phase, a model M _k is obtained that identifies all the types of entities that have been seenCalculate the output distribution of M _k/>KL divergence between as a function of distillation loss:

7) Each word in dataset D _k is divided into two categories: one with and one without an entity tag; for words with entity tags, calculate Cross entropy loss function of the output of (c) and the entity tag:

For O-tagged words, calculate KL divergence of the output distribution of M _k-1 from the output distribution of M _k-1:

Wherein, Respectively represent M _k-1 and/>Output distribution of (a); t represents the temperature in the distillation to obtain a smoother probability distribution;

8) The weighted sum of the three loss functions gives the total loss function in the review phase:

The invention uses the old type of unlabeled text to provide a new type of supervisory signal for the review material, uses the teacher to provide the old type of supervisory signal, and uses the supervisory signal to restrict the output of the new student model on the review material after the new and old types of supervisory signals exist.

Drawings

FIG. 1 is a unitary frame of the present invention;

FIG. 2 is a dataset statistics;

Fig. 3 is the main experimental result.

Detailed Description

The invention comprises a main model (M) for named entity recognition, a generator (G) for generating pseudo-samples,

The main model named entity recognition is typically modeled as a sequence labeling task, i.e., each word is assigned a label. The main model of the invention consists of a feature extractor and a classification layer. The feature extractor adopts a pre-training language model BERT-base, and the classification layer adopts a linear layer with softmax. Given a word sequence [ x ₁,x₂,...,x_L ] of length L and a label [ y ₁,y₂,...,y_L ] of each word, firstly obtaining a hidden vector [ h ₁,h₂,...,h_L ] of each word through a feature extractor, then mapping the hidden vector to a label space [ s ₁,s₂,...,s_L ] through a linear layer, and obtaining the probability [ p ₁,p₂,...,p_L ] of each word on all types through softmax:

z_i＝Wh_i+b

Wherein, D is the hidden vector size of the pre-training language model, and d is 768; /(I)M is the size of the tag set, depending on the tag system employed, the present invention employs a BIO tag system, m is 2n+1, n is the number of entity types, and each step is dynamically increased.

The training objective function of the master model is cross entropy loss, which encourages the model to correctly predict the label of each word:

Wherein, Is the probability that the word x _i belongs to the tag y _i; θ is all trainable parameters.

The generator is a language model formed by an embedding layer, an LSTM layer and a classifier, given a word sequence [ x ₁,x₂,...,x_L ] with a length L, a word vector of each word is firstly obtained through the embedding layer, wherein the invention adopts FastText word vectors which are disclosed in a document Joulin A,Grave E,Bojanowski P,et al.Fasttext.zip:Compressing text classification models[J](arXiv preprint arXiv:1612.03651,2016.),, then a hidden vector [ h ₁,h₂,...,h_L ] which is integrated with context information is obtained through the LSTM layer, and finally the probability of the next word is obtained through a linear layer with softmax:

z_i＝Wh_i+b

Where z _i∈R^v, V is the size of the dictionary, determined by the dataset; index (x _i) represents the number of x _i in the dictionary.

The training objective of the generator is a language modeling penalty function that minimizes the negative log likelihood penalty for predicting the next word:

the learning stage of the invention

Assuming that in the kth step, what can be used includes the current dataset D _k and the M _k-1,G_1:k-1 model obtained in the previous step, the goal of the learning phase is to obtain a modelIt can identify all the entity types seen/>

First, the current model is initialized with parameters of M _k-1 And its linear layer is extended to accommodate the new number of entity types. Specifically, it extends from h× (2n+1) to h× (2n+2m+1), where/>M= |e _k | represents the number of old types and the number of new types, respectively.

Secondly, the invention regards M _k-1 as a teacher,Is considered as a student and distills knowledge of the old entity type in M _k-1 to/>, by knowledge distillationIs a kind of medium. Specifically, each word in the dataset can be divided into two categories: one with the physical label and the other without the physical label (label O). For words with entity tags, the present invention calculates/>Cross entropy loss function of the output of (c) and the entity tag:

For O-tagged words, it is possible to be an old type of entity tag, but under the present setting this information is not labeled, the present invention calculates KL divergence of the output distribution of M _k-1 from the output distribution of M _k-1:

Wherein, Respectively represent M _k-1 and/>Output distribution of (a); t represents the temperature in the distillation for a smoother probability distribution, the present invention being set to 2. In order to make the dimensions of the two output distributions identical, the invention complements the class dimension of the output of M _k-1 with a small constant and then renormalizes.

In summary, the total loss function of the learning phase is a weighted sum of two loss functions:

wherein, the values of alpha and beta are set to be 1.

Review phase of the invention

The purpose of the review phase is to wake up the old type of knowledge and integrate it with the new type of knowledge by further distillation on the old type of dummy sample, resulting in the final model M _k of the kth step.

First, for each old task i ε {1,2, …, k-1}, the present invention uses G _i to generate unlabeled text that contains the old type E _i

Secondly, the invention feeds the non-marked text into M _k-1 and the first student obtained in the first stage respectivelyObtaining the output probability distribution P (x _i;θ_k-1, T) and/>, over all seen entity types

The invention then takes the front in the output profile of M _k-1 Dimension/>Output distribution of/>To/>Dimension, splice them to get/>

Then, calculate the output distribution of M _k andKL divergence between as a function of distillation loss:

The loss in the learning phase is still calculated at D _k:

in summary, the total loss function of the review phase is a weighted sum of three loss functions:

The invention is implemented with reference to the details provided by Monaikul et al, using BERT-base as the extractor, pytorch of Huggingface as the programming framework, running the program on a single GeForce RTX3090 display card, batch size 32, maximum sentence length 128, maximum training round number 20, early stop round number set to 3, using Adam as the optimizer, learning rate 5e-5, weight of the penalty function set to 1, generator in L & R defaulting to 3000 samples, sampling 6 and 8 task sequences for CoNLL-03 and OntoNotes-5.0, respectively.

Preliminary experiments have found that significant improvement can be achieved using a layer of LSTM model as a generator, with an average run time of 10 min/task, and a model size of about 50 MB/task.

The invention uses a common data set for named entity recognition, coNLL-03, which is described in Sang E F,De Meulder F.Introduction to the CoNLL-2003shared task:Language-independent named entity recognition[J](arXiv preprint cs/0306050,2003.) and a data set OntoNotes-5.0, which is described in Hovy E,Marcus M,Palmer M,et al.OntoNotes:the 90％solution[C](//Proceedings of the human language technology conference of the NAACL,Companion Volume:Short Papers.2006:57-60.), wherein CoNLL-03 comprises four entity types: person (PER), location (LOC), organization (ORG), miscellaneous (MISC), the present invention selects the six most representative entity types of OntoNotes-5.0 with reference to Monaikul et al ：person(PER)、geo-political entity(GPE)、organization(ORG)、cardinal(CARD)、Nationalities and Religious Political Group(NORP).

The invention adopts the following settings to simulate the data accumulation process in reality, and the invention carries out the following operations on samples in the original data set to construct a training/verification set of the kth task: for a sentence of the original training/validation set [ x ₁,x₂,…,x_L ] and its tag [ y ₁,y₂,…,y_L ], the present invention replaces y _i with O ifThe invention marks the replaced label as/>If/>Not all O, it is added to the training/validation set of the kth task. In constructing the test set of the kth task, the invention replaces E _k with/>

After the above operation, the statistics of the training/verifying/testing set of each task are as shown in fig. 2:

Referring to Monaikul et al, to evaluate the average performance of the model over all types seen, a macro-average F1 (macro-average F1) is used and the results of the sampled task sequences are averaged, defined as follows:

Wherein the method comprises the steps of Indicating that under the r task sequence, all the seen entity types are accumulated to the kth step,/>And F1 value of the e entity of the kth step in the r task sequence is represented.

In order to realize the model more comprehensively, the invention also measures the robustness of the model to the task sequence, and the index adopted by the invention is an Error Bound (EB) which is defined as follows:

Wherein, Is the confidence coefficient at a confidence level, σ is the standard deviation calculated for n different task orders, with lower upper error bounds representing lower order sensitivity.

The present invention compares ExtendNER proposed by Monaikul et al as a baseline to the method of the present project and selects the "multitasking" as mentioned in section 2.1.3, section first, to measure the upper-limit effect.

As shown in fig. 3, it can be seen from the first row and the third row of the graph that the L & R proposed by the present invention exceeds ExtendNER in all steps (step) of the two data sets, and the more steps, the more the L & R is promoted, because the method of the present invention improves the effect of each step, thereby alleviating the error propagation caused by distillation. In addition to the cumulative lift, the present invention also provides a lift immediately after each step has completed the "review phase", the fifth row represents the effect of the model before the "review phase", the fourth row represents the effect of the model after the "review phase", and the difference is the immediate lift brought by the "review phase". The second and fourth rows of fig. 3 also give the upper error bound for the model, and it can be seen that the upper error bound for L & R is lower, indicating that the model of the present invention is less sensitive to task order.

Claims

1. The incremental named entity recognition method is characterized by comprising a learning stage and a review stage, wherein in the learning stage, a training set only comprising new entity types is given, an old model is used as a teacher, and knowledge distillation loss is increased on the conventional cross entropy loss when a new student model is trained; in the review phase, pseudo-samples are generated as review material for the old type, and old friend knowledge is warmed by further distillation on the review material and integrated with the new knowledge; the method comprises the following specific steps:

1) In the learning stage, in the kth step, a current dataset D _k and M _k-1 and G _1:k-1 models obtained in the previous step are obtained; where M is the master model and G is the generator for generating the pseudo-samples;

Model M _k consists of a feature extractor and a classification layer, wherein the feature extractor adopts a pre-trained language model BERT-base, the classification layer adopts a linear layer with softmax, a word sequence [ x ₁,x₂,...,x_L ] with the length of L and a label [ y ₁,y₂,...,y_L ] of each word are given, firstly, a hidden vector [ h ₁,h₂,...,h_L ] of each word is obtained through the feature extractor, then the hidden vector is mapped to a label space [ s ₁,s₂,...,s_L ] through the linear layer, and then the probability [ p ₁,p₂,...,p_L ] of each word on all types is obtained through the softmax:

z_i＝Wh_i+b

Wherein, D is the hidden vector size of the pre-training language model, and d is 768; /(I)M is the size of the tag set, depending on the tag system employed;

3) In review phase, for each old task i ε {1,2, …, k-1}, generator G generates unlabeled text containing old type E _i The generator G is a language model formed by an embedding layer, an LSTM layer and a classifier, a word sequence [ x ₁,x₂,...,x_L ] with the length L is given, word vectors of each word are obtained through the embedding layer, then hidden vectors [ h ₁,h₂,...,h_L ] which are integrated with context information are obtained through the LSTM layer, and finally the probability of the next word is obtained through a linear layer with softmax:

z_i＝Wh_i+b

Wherein, V is the size of the dictionary, determined by the dataset; index (x _i) represents the number of x _i in the dictionary;

4) Feeding the unlabeled text to M _k-1 and students obtained in learning stage Obtaining the output probability distribution P (x _i;θ_k-1, T) and/>, over all seen entity typesWhere θ is all trainable parameters; t represents the temperature in the distillation to obtain a smoother probability distribution;

5) Taking the front in the output distribution of M _k-1 Dimension/>Output distribution of/>To/>Dimension, splice them to get/>

Wherein, Is the probability that the word x _i belongs to the tag y _i;

For words without entity tags, calculate KL divergence of the output distribution of M _k-1 from the output distribution of M _k-1:

Wherein, Respectively represent M _k-1 and/>Output distribution of (a);