CN111738003A

CN111738003A - Named entity recognition model training method, named entity recognition method, and medium

Info

Publication number: CN111738003A
Application number: CN202010541415.5A
Authority: CN
Inventors: 程学旗; 郭嘉丰; 范意兴; 张儒清; 刘艺菲
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-10-02
Anticipated expiration: 2040-06-15
Also published as: CN111738003B

Abstract

The embodiment of the invention provides a named entity recognition model training method, a named entity recognition method and a medium, the named entity recognition model training method comprises the steps of firstly training a first training model by using source field labeled data and a target field unlabeled data set, setting a second training model based on parameters of the first training model, and finely adjusting the second training model by using the target field labeled data set, so that a final named entity recognition model is obtained, and therefore, the problem that a large number of samples in the target field are required to be labeled for training is solved.

Description

Named entity recognition model training method, named entity recognition method, and medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to the field of named entity recognition technologies, and more particularly, to a named entity recognition model training method, a named entity recognition method, and a medium.

Background

Natural language processing is to let the computing mechanism understand human language, so as to better realize the interaction between human and computing (e.g. the interaction between applications such as voice assistant, automatic message reply, translation software, etc. and human). Natural language processing typically includes word segmentation, part-of-speech tagging, named entity recognition, and parsing. Named Entity Recognition (NER) is an important component of Natural Language Processing (NLP). Named entity recognition refers to a process of recognizing names or symbols of things having specific meanings in a text, and named entities mainly comprise names of people, places, organizations, dates, proper nouns and the like. Many downstream NLP tasks or applications rely on the NER for information extraction, such as question answering, relationship extraction, event extraction, and entity linking. If the named entities in the text can be recognized more accurately, the computer can better understand the semantics of the language and better execute tasks, so that the human-computer interaction experience is improved.

The named entity recognition method based on the deep neural network generally regards named entity recognition as a multi-classification task or a sequence tagging task and can be divided into three processes of input distributed representation, semantic coding and label decoding, wherein the input distributed representation can be divided into three types of character level, word level and mixing according to a coding object, and vector representation of each word can be obtained; semantic coding generally applies a deep neural network, such as a Bidirectional long and short memory neural network, a Transform-based Bidirectional Encoder Representation (BERT for short), a transition learning network, and the like, and can obtain a vector Representation of a text by using a word vector of each word in the text; tag decoding is done by a classifier, which usually uses a fully-connected neural network + Softmax layer or a conditional random field + Viterbi algorithm (Viterbi algorithm) to derive the tag for each word.

Named entity recognition is currently not a hot research direction because it is considered by the academia part to be a problem that has been solved. However, many researchers believe that this problem has not been solved well, mainly because named entity recognition only works well in limited text types (mainly in news corpora) and entity categories (mainly names of people, places, organizations); in other natural language processing fields, named entity evaluation corpora are small, overfitting is easy to generate, and the system performance for universally identifying various types of named entities is poor.

Named entity recognition based on deep learning has achieved good effect on English news corpus (F1 value is more than 90%), but deep learning method generally needs a lot of labeled data, and in real world, many languages and fields generally have less labeled data, so that low-resource named entity recognition problem occurs. The transfer learning is a common method for solving the problem of low-resource named entity identification at present, but the transfer learning applied to the problem of low-resource named entity identification at present has the problems of unbalanced data volume and label resources, and is more biased to the problem of high-resource data (a data set with larger data volume) during common learning, so that the identification effect of a named entity identification model is poor. Therefore, there is a need for improvements in the prior art.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a named entity recognition model training method, a named entity recognition method and a medium.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided a named entity recognition model training method, the method comprising: a1, constructing a first training model, wherein the first training model comprises a feature extraction module, a recognition module and a field distinguishing module; a2, performing multiple rounds of training on the first training model, wherein in each round of training, the first data set is used for training the recognition module, the first data set and the second data set are used for performing antagonistic training on the feature extraction module and the domain distinguishing module, parameters of the feature extraction module are adjusted at least according to the loss function of the recognition module and the loss function of the domain distinguishing module after each round of training, the first data set and the second data set are updated simultaneously, and the updated first data set and the updated second data set are used for performing the next round of training, wherein the first data set is a source domain labeled data set with entity labels represented in a word vector form, and the second data set is a target domain unlabeled data set without entity labels represented in a word vector form; a3, constructing a second training model, wherein the second training model comprises a feature extraction module and an identification module, the initial parameters of the feature extraction module of the second training model are set by the parameters of the feature extraction module of the first training model trained in the step A2, and the initial parameters of the identification module are set in a random initialization mode; and A4, carrying out parameter fine adjustment on the feature extraction module and the recognition module of the second training model constructed in the step A3 by using a third data set in a supervised training mode, and taking the second training model after parameter fine adjustment as a named entity recognition model, wherein the third data set is a target domain marking data set with entity labels and represented in a word vector form.

Preferably, the size of the source domain labeled data set is the same or approximately the same as the size of the target domain unlabeled data set, and the size of the target domain labeled data set is smaller than the size of the target domain unlabeled data set.

Preferably, the same or approximately the same scale means that the ratio of the data amount of the source domain labeled data set to the target domain unlabeled data set is: 10: 14-10: 9.

In some embodiments of the present invention, the feature extraction module in the first training model includes a preprocessing layer, a CNN model, a Word2Vec model, and a BiLSTM model including a forward LSTM and a backward LSTM, where the forward LSTM and the backward LSTM respectively include a plurality of sequentially connected LSTM units; the feature extraction module respectively processes a source field marking data set, a target field unmarked data set and a target field marking data set represented in a non-word vector form to obtain a first data set, a second data set and a third data set as follows: preprocessing words of the data set by using the preprocessing layer, wherein the words comprise uniform capital and small case and stop word removal; extracting the character level embedding characteristics of each word in the data set by using a CNN model; extracting Word embedding characteristics of each Word in the data set by using a Word2Vec model; the character level embedding characteristics and the word embedding characteristics of each word in the data set are spliced in series to obtain vector representation of each word; and (3) the vector representation of each word in the data set is input into a BilSTM model of the feature extraction module for processing to obtain the data set containing context information and represented in the word vector form.

In some embodiments of the present invention, the recognition modules of the first training model and the second training model both include a BiLSTM-CRF model, wherein the entity labels of the source domain labeled data are used to set the label value space of the CRF layer of the BiLSTM-CRF model of the recognition module in the first training model, and the entity labels of the target domain labeled data set are used to set the label setting of the CRF layer of the BiLSTM-CRF model of the recognition module of the second training model.

In some embodiments of the present invention, the first training model further includes a gradient inversion layer, during the countermeasure training of the feature extraction module and the domain distinguishing module, a standard random gradient descent operation is performed on the feature extraction module and the domain distinguishing module of the first training model through the gradient inversion layer during forward propagation, and parameters of the gradient inversion layer are automatically inverted before returning a loss function of the domain distinguishing module to the feature extraction module during backward propagation, so that the feature extraction module extracts common features of words in the source domain labeled data set and the target domain unlabeled data set.

In some embodiments of the present invention, the first training model further includes an automatic encoding module, the automatic encoding module is trained by using the second data set, and after each round of training, the parameters of the feature extraction module are updated according to the loss function of the automatic encoding module, the loss function of the recognition module, and the loss function of the domain distinguishing module.

In some embodiments of the present invention, the automatic encoding module comprises an encoder and a decoder, wherein in each training round, the encoder acquires hidden states of the last LSTM in the forward LSTM and the last LSTM in the backward LSTM extracted from the words of the target domain unlabeled dataset by the BiLSTM model of the feature extraction module and combines them into an initial state feature of the decoder, and uses the initial state feature and its previous word embedding feature as input to the decoder to train the automatic encoding module to extract the private features of the target domain.

In some embodiments of the invention, the parameters of the feature extraction module of the first training model are adjusted in the following manner:

wherein, theta_fRepresents the parameter of the feature extraction module after the adjustment, theta'_fRepresents the parameters of the feature extraction module before the current adjustment, mu represents the learning rate, L_taskLoss function, L, representing the identity module_typeLoss function, L, representing a domain-discriminating block_targetA loss function representing an automatic encoding module, - ω represents a gradient flipping parameter, and α, β, γ represent weights set by a user.

In some embodiments of the present invention, the step a2 further comprises: after each round of training, parameters of the recognition module, the domain distinguishing module and the automatic coding module of the first training model are adjusted according to the following modes:

the corresponding parameter adjustment mode of the identification module is as follows:

the parameter adjusting mode corresponding to the domain distinguishing module is as follows:

the corresponding parameter adjusting mode of the automatic coding module is as follows:

wherein, theta_yRepresents the parameter of the identification module after the current adjustment, theta'_yParameter, theta, representing the identification module before this adjustment_dRepresents the parameter theta 'of the adjusted domain distinguishing module'_dIndicating the division of the field before the adjustmentParameter of the module, θ_rRepresents the parameter of the adjusted automatic coding module of theta'_rParameters of the automatic coding module before the current adjustment are shown, mu shows the learning rate, and α, β and gamma show the weight set by the user.

According to a second aspect of the present invention, there is provided a named entity recognition method based on a named entity recognition model trained by the named entity recognition model training method according to the first aspect, wherein the named entity recognition model includes a feature extraction module and a recognition module, and the named entity recognition method includes: b1, acquiring character level embedded features and word embedded features of the text to be recognized through a feature extraction module of the named entity recognition model, and performing series connection and splicing to obtain word vectors of words in the text to be recognized; b2, inputting the text to be recognized represented in the form of word vector into the recognition module of the named entity recognition model, and obtaining the named entity recognition result of the text to be recognized.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and a memory, wherein the memory is to store one or more executable instructions; the one or more processors are configured to implement the steps of the method of the first aspect or the second aspect via execution of the one or more executable instructions.

Compared with the prior art, the invention has the advantages that:

the method comprises the steps of training a first training model by using source field labeled data and a target field unlabeled data set, setting a second training model based on parameters of the first training model, and finely adjusting the second training model by using the target field labeled data set, so as to obtain a final named entity recognition model. Therefore, the problem that a large number of samples for marking the target field are needed to be used for training is solved, and the recognition effect of the trained named entity recognition model in recognizing the named entities of the words in the unmarked data set of the target field is improved.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a simplified schematic diagram of a named entity recognition model training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a structural principle of a named entity recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic view of a saddle point according to an embodiment of the present invention;

FIG. 4 is a schematic illustration of named entity identification as a prior model of the baseline experiment of the present invention;

FIG. 5 is a schematic representation of two prior methods as comparative experiments with the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background section, the current named entity recognition model is obtained by supervised training on a labeled data set (i.e. which words in the labeled data set are named entities) in a specific field, which can achieve higher recognition accuracy in the specific field, but if the model is directly applied to other fields, the problems of poor generalization ability and low recognition accuracy exist. However, in many languages and fields in the real world, generally, the labeled data is less, and the named entity recognition model required by the field is difficult to obtain through supervision and training, and if the data in the fields is labeled manually, not only the person in charge of manual labeling has clear knowledge on the names of various named entities in each field, but also the person in charge of manual labeling is required to accurately label various named entities from massive data, and the workload is large and the cost is high. And if the source domain labeled data set with a relatively large scale and the target domain labeled data set with a relatively small scale are directly used for transfer learning, the problem of label resource imbalance exists, so that the problem that the common learning is more biased to high resource data exists, and the recognition effect of the trained model in the target domain is poor.

Therefore, the method provided by the invention considers that the first training model is trained by using the source field labeled data and the target field unlabeled data set, the second training model is set based on the parameters of the first training model, and the second training model is finely adjusted by using the target field labeled data set, so that the final named entity recognition model is obtained. Therefore, the target field data can be divided into a target field unmarked data set and a target field marked data set to train the model, so that the scales of the target field unmarked data set and the source field marked data set can be set to be approximately the same to avoid the final training of the model parameter biased data set with large data quantity; and after training, fine-tuning the second training model by using a target field labeled data set with a size smaller than that of the target field unlabeled data set, thereby obtaining a final named entity recognition model, and avoiding the problem that a large number of samples for labeling the target field are required for training.

Before describing embodiments of the present invention in detail, some of the terms used therein will be explained as follows:

antagonistic training, also known as antagonistic learning, was proposed by Goodfellow et al, the basic idea being based on two models: a generative model and a discriminant model. The task of the discrimination model is to determine whether a given picture is real or artificially modified, and the task of the generation model is to simulate to generate a synthetic picture similar to the pictures in the picture set. Through repeated countermeasures in the training process, the capabilities of generating the model and distinguishing the model are continuously enhanced until a balance is achieved. We can view this process as a kind of a game of zeros and sums. Currently, antagonistic learning has been successfully used for image generation, semi-supervised learning, and domain adaptation. The key idea of the domain adaptive countermeasure learning network is to construct general and unchangeable features by differentiating the module for the countermeasure domain in the process of optimizing the feature extraction module.

The migration learning is learning to migrate the learned knowledge to another unknown knowledge, i.e., from the source domain to the target domain.

The source domain tagged data set refers to a data set of a source domain tagged with an entity through entity tagging. In other words, the entity objects in the source domain markup dataset carry entity tags of their corresponding types.

The target domain unlabeled dataset refers to a dataset of the target domain that is not labeled with an entity. Not labeled means that it is not necessary to label the training process of the present invention prior to it. Even if a portion of the collected data carries entity labels originally tagged by others, it is considered a data set without entity labels, because this portion of entity labels is not considered or used during the course of the resistance training.

The target domain tagged data set refers to a data set of a target domain tagged with an entity by an entity tag.

Cnn (volumetric Neural networks), which represents a convolutional Neural network, is a type of feed-forward Neural network that contains convolution calculations and has a deep structure.

The Word2Vec (Word to Vector) model is a natural language processing model that vectorizes words. The working principle of the Word2Vec model is to learn the semantic information of words from a large amount of text corpora in an unsupervised manner and output Word vectors to represent the semantic information of the words.

LSTM (Long Short-Term Memory), which represents a Long Short-Term Memory network, is a recurrent neural network. LSTM was mainly presented to solve the problems of gradient extinction and gradient explosion during long sequence training. Compared with a common Recurrent Neural Network (RNN), the LSTM has better performance in learning long-term dependence information in a long sequence.

BilSTM (Bi-directional Long Short-Term Memory) represents a bidirectional Long Short-Term Memory network.

CRF (conditional Random fields), which represents a conditional Random field, is a probabilistic undirected graph model for solving a conditional probability P (y x) given an input Random variable x. What the conditional random field model needs to model is the conditional probability distribution of the input variables and the output variables. Conditional random fields are commonly used to label or analyze sequence data, such as natural language text or biological sequences. When the method is used for sequence marking, the input and output random variables are two sequences with equal length.

Mlp (multi layer perceptron), which represents a multi-layer perceptron, is a feedforward artificial neural network model for performing multi-layer linear or non-linear transformations.

Stop words refer to words or phrases that are not unambiguous and that are automatically filtered out before or after processing the natural language data (or text). Such as moods, adverbs, prepositions, conjunctions, etc. The stop words are removed, so that the storage space can be saved, and the processing efficiency can be improved.

Referring to fig. 1, the training process of the named entity model training method of the present invention mainly includes the following stages: the method comprises the steps of firstly training a first training model by using a first data set and a second data set which are the same or approximately the same in scale, transferring knowledge of the first training model to a second training model in a transfer learning mode after training is completed, and then finely adjusting the second training model by using a third data set which is smaller than the second data set in scale to obtain a named entity recognition model.

According to an embodiment of the present invention, a method for training a named entity recognition model is provided, which includes steps a1, a2, A3, and a 4. For a better understanding of the present invention, each step is described in detail below with reference to specific examples.

Step A1: and constructing a first training model, wherein the first training model comprises a feature extraction module 11, a recognition module 12 and a domain distinguishing module 13.

According to one embodiment of the invention, the feature extraction module 11 in the first training model comprises a preprocessing layer, a CNN model, a Word2Vec model and a BilSTM model containing a forward LSTM and a backward LSTM, wherein the forward LSTM and the backward LSTM respectively comprise a plurality of sequentially connected LSTM units; the feature extraction module 11 respectively performs the following processing on a source field labeled data set, a target field unlabeled data set and a target field labeled data set represented in a non-word vector form to obtain a first data set, a second data set and a third data set: preprocessing words of the data set by using the preprocessing layer, wherein the words comprise uniform capital and small case and stop word removal; extraction of character-level embedding features of words in data set by CNN model(ii) a Extracting Word embedding characteristics of each Word in the data set by using a Word2Vec model; the character level embedding characteristics and the word embedding characteristics of each word in the data set are spliced in series to obtain vector representation of each word; the vector representation of each word in the data set is input into a BilSTM model of the feature extraction module 11 for processing, and a data set containing context information and represented in a word vector form is obtained. In brief, the feature extraction module 11 may extract character-level embedded features and word embedded features common to the source domain and the target domain, and a word vector containing context information. Referring to fig. 2, the source domain and the target domain are samples of sentence components, which are input to the feature extraction module 11. The feature extraction module 11 extracts character-level embedded features using CNN

The problem that the words do not appear in the dictionary (OOV) can be effectively solved. Then embed the word into the feature

With character-level embedding features

Connected together, as input to the next layer of BilSTM, feature extraction module 11 models the sentence using BilSTM, which can capture context information. Denote the input word sequence (sample) as x and the ith word as x_i。x_i∈ S (x) and x_i∈ T (x) indicates that the input samples come from the source domain and the target domain respectively for the convenience of the following description, the parameter of the feature extraction module 11 is denoted as theta_fThe word vector containing the context information extracted by the feature extraction module 11 is represented as F (x)_i) A word sequence represented in the form of a word vector is denoted f (x).

According to one embodiment of the invention, the recognition module 12 of the first training model comprises a BilSTM-CRF model. Wherein the label of the CRF layer of the BilSTM-CRF model of the recognition module 12 in the first training model is set by adopting the entity label of the source field marking data setA value space. Such as the exemplary result shown in the identification module 12 of fig. 2, B-GPE represents an exemplary entity tag, such as an entity of a country, city, state, O represents a tag of a non-entity, and the identification module 12 is used to make named entity identification labels. The recognition module 12 takes F (x) as input, and calculates a loss function using the CRF layer, the maximum likelihood estimation and the Viterbi algorithm for each word vector F (x) in F (x)_i) Mapping to a solid label of the CRF layer, and using a feature function to express features more abstractly by a CRF algorithm of the CRF layer, wherein an objective function of the CRF algorithm is as follows:

where x represents the input word sequence, y represents the output entity tag sequence, θ_yIs the weight of the feature function, Z (x) is a normalization factor, i is the position of the current word, M is the length of the input word sequence, j represents the number of the feature functions, theta_yjWeight, f (x, i, y) representing the jth characteristic function_i,y_i-1) Representing a characteristic function, y_iOutput entity tag for current location, y_i-1The output entity tag of the previous location. The meaning expressed by the objective function is: given an input word sequence x and a feature function weight θ_yOutputting conditional probability of occurrence of the label sequence y, and taking the entity label with the highest probability as the entity label y of the corresponding word in the word sequence x_i。

The above normalization factor z (x) is expressed as:

where Y represents the set of all possible occurrences of output entity tag sequences.

The parameters of the identification module 12 are the weight theta of the characteristic function_yThe maximum likelihood estimation is used for solving the parameters, and a source field training set is assumed as

N_SThe number of the samples in the source field,

nth representing source domain_SThe number of the samples is one,

nth representing source domain of output_SThe physical tag sequence of each sample.

The recognition module 12 is trained by computing the log-likelihood as a loss function using the following equation:

where k represents the sequence number of the current sample.

According to one embodiment of the invention, the domain differentiating module 13 of the first training model comprises a multi-layer perceptron MLP comprising a Softmax layer. The domain differentiating module 13 takes f (x) as input and is a standard feed forward network. The training goal of the domain discrimination module 13 is to make it as difficult as possible to discriminate between the source domain and the target domain of the sample. The domain distinguishing module 13 maps the same hidden state h to a domain label, and the parameter of the mapping is denoted as θ_d. The domain discrimination module 13 aims at identifying domain labels by the following loss function:

wherein d is_kIs a standard domain label for the sample k,

is the output of the domain discrimination block 13Q on sample k,

N_trepresenting N from the target domain_tAnd (4) sampling. By maximizing the parameter θ of the feature extraction module 11F_fWhile minimizing the parameter θ of the domain distinguishing module 13Q_dLoss of the upper part of the bodyThe domain discrimination module 13 trains to the saddle points of the loss function. A saddle point is a point that slopes upward in one dimension and downward in the other dimension. As shown in FIG. 3, the saddle point is usually surrounded by a plane of equal error values, which makes it difficult for the algorithm to fall into because the gradient is close to zero in all dimensions. Optimizing the parameter θ of the feature extraction module 11_fIt can be ensured that the domain distinguishing module 13 cannot distinguish the domains, i.e., the feature extraction module 11F can find common features common between the source domain and the target domain. During the training process, the parameter θ of the domain distinguishing module 13_dIdentifying the parameter θ of the module 12_yAfter update, the parameter θ of the module 13 will be distinguished according to the updated domain_dIdentifying the parameter θ of the module 12_yParameter θ to feature extraction Module 11F_fOptimization is performed to minimize the classification loss L_taskThus, P (F (x)) can be ensured_i) Can make an accurate prediction of the source domain.

According to an embodiment of the invention, the first training model further comprises a gradient inversion layer. In the process of performing countermeasure training on the feature extraction module 11 and the domain distinguishing module 13, a standard random gradient descent operation is performed on the feature extraction module 11 and the domain distinguishing module 13 of the first training model through a gradient inversion layer during forward propagation. In reverse propagation, the parameters of the gradient inversion layer are automatically inverted before returning the loss function of the domain differentiating module 13 to the feature extraction module 11, so that the feature extraction module 11 extracts the common features of words in the source domain labeled dataset and the target domain unlabeled dataset.

According to an embodiment of the present invention, referring to fig. 2 again, the first training model further includes an automatic encoding module 14, the automatic encoding module 14 is trained by using the second data set, and after each round of training, the parameters of the feature extraction module 11 are updated according to the loss function of the automatic encoding module 14, the loss function of the recognition module 12, and the loss function of the domain distinguishing module 13. Preferably, the automatic encoding module 14 comprises an encoder and a decoder, wherein in each training round, the encoder obtains the forward LSTM extracted by the BilTM model of the feature extraction module 11 to the words of the unmarked data set of the target domainThe hidden states of the last and backward LSTM are combined into the initial state features of the decoder and the initial state features and its previous word-embedded features are used as input to the decoder to train the auto-encode module 14 to extract the proprietary features of the target domain. Antagonistic learning attempts to optimize the hidden representation to a generic representation h_commonThe parameters of the recognition module 12 of the second training model are initialized by the parameters of the recognition module 12 obtained through the optimization process of the counterlearning, and the target domain automatic coding module 14 obtains the domain feature representation containing the target domain information by adjusting the general representation to include both a part of the general features of the source domain and the target domain and a part of the domain private features of the target domain data, and the domain feature representation is used as the feature extraction module 11 of the final model to counteract the trend of the counterlearning network for removing the target domain features. In other words, the automatic encoding module 14 performs feature learning of the target domain, retaining its domain characteristics. By training the confrontation learning network composed of the feature extraction module 11 and the domain distinguishing module 13, the general features h of the source domain and the target domain can be obtained_commonBut it will weaken some domain specific features useful for named entity recognition, and it can be seen that obtaining only domain generic features will limit the classification capability. The present invention therefore addresses this deficiency by introducing an automatic encoding module 14 of the target domain, the automatic encoding module 14 attempting to reconstruct the target domain data. The invention uses the encoder of the automatic coding module 14 to obtain the final hidden state combination of the forward LSTM and the backward LSTM in the BiLSTM model in the feature extraction module 11 to be the initial state h of the decoder LSTM₀(dec). Thus, the present invention does not require reversing the word order of the input sentence (word sequence), and the model avoids the difficulty of establishing communication between the input and output. Use of h₀(dec) and previous word embedding features as input to the decoder. Suppose that

Is the output word sequence, z_iIs the ith word representation: z is a radical of_i＝MLP(h_i) Wherein the MLP is a multisensory machine. Hidden state h_i＝LSTM([h₀(dec):z_i-1],h_i-1) Wherein [. cndot: a]Is a tandem operation, meaning that₀(dec) and previous word embedding feature z_i-1In series, and the previous position hidden state h_i-1And the output is the hidden state of the current position. Then at a given h₀(dec) conditional on output of word sequence

Conditional probability of (2)

As shown in the following equation:

wherein each one

Is to calculate the softmax probability over all words in the dictionary.

The object of the invention is to target the parameter θ of the automatic coding module 14_rMinimizing a loss function as shown in the following equation:

wherein the content of the first and second substances,

is the one-hot vector for the ith word of sample k. This results in h₀(dec) learning incomplete and most significant sentence representations on the target domain data. The countering learning network attempts to optimize the hidden representation to the generic representation h_commonThe target domain autoencode module 14 counteracts the tendency of the anti-learning network to erase the target domain private features by optimizing the private features for which the generic representation adds target domain data.

Step A2: and performing multiple rounds of training on the first training model, wherein in each round of training, the recognition module 12 is trained by using a first data set, the feature extraction module 11 and the domain distinguishing module 13 are subjected to antagonistic training by using the first data set and a second data set, parameters of the feature extraction module 11 are adjusted at least according to a loss function of the recognition module 12 and a loss function of the domain distinguishing module 13 after each round of training, the first data set and the second data set are updated at the same time, and the updated first data set and the updated second data set are subjected to next round of training, wherein the first data set is a source domain labeled data set with an entity label represented in a word vector form, and the second data set is a target domain unlabeled data set without the entity label represented in a word vector form.

Wherein the size of the source domain tagged dataset is the same or about the same as the size of the target domain untagged dataset. The size of the target domain tagged dataset is smaller than the size of the target domain untagged dataset. Preferably, the same or approximately the same scale means that the ratio of the data amount of the source domain labeled data set to the target domain unlabeled data set is: 10: 14-10: 9. The scale of the labeled data set of the source field is the same as or approximately the same as that of the unlabeled data set of the target field, so that the condition that the parameters of the model trained in the countertraining are not biased to the field with larger data volume due to resource imbalance can be avoided, and the final model can obtain better effect of named entity recognition in the target field.

Preferably, training recognition module 12 with the first data set includes training recognition module 12 with a word sequence represented by the word vector and the entity labels of the words in the word sequence to enable it to identify the entity labels to which the words belong based on the word vector.

Preferably, the process of performing the countermeasure training on the feature extraction module 11 and the domain distinguishing module 13 by using the first data set and the second data set includes: during a round of training, the domain distinguishing module 13 takes the word sequence f (x) containing the context information generated by the feature extraction module 11 as input, and the training domain distinguishing module 13 outputs a classification result of whether the word sequence is from the source domain or the target domain; in the back propagation process, the parameters of the feature extraction module 11 are adjusted at least according to the loss function of the domain distinguishing module 13 obtained by the previous training, so that the feature extraction module 11 with new parameters can generate a new word sequence f (x) containing context information, and the above steps are repeated for the next round of training according to the new word sequence f (x) containing context information.

Preferably, the parameters of the feature extraction module 11 of the first training model are adjusted after each round of training in the following manner:

wherein, theta_fRepresents the parameter θ 'of the feature extraction module 11 after the current adjustment'_fRepresents the parameters of the feature extraction module 11 before the current adjustment, mu represents the learning rate, L_taskRepresenting a loss function, L, of the recognition module 12_typeA loss function, L, representing the domain-discriminating block 13_targetRepresents the loss function of the automatic encoding module 14, - ω represents the gradient flipping parameter, and α, β, γ represent the user-set weights.

Preferably, after each round of training, the parameters of the recognition module 12, the domain distinguishing module 13 and the automatic coding module 14 of the first training model are adjusted as follows:

the corresponding parameter adjustment mode of the identification module 12 is as follows:

the parameter adjustment mode corresponding to the domain distinguishing module 13 is as follows:

the corresponding parameter adjustment mode of the automatic coding module 14 is as follows:

wherein, theta_tDenotes a parameter θ 'of the current adjusted identification module 12'_yIndicating the current adjustment prior knowledgeParameter of the other module 12, θ_dDenotes a parameter θ 'of the adjusted domain partition module 13'_dA parameter theta representing the domain distinguishing module 13 before the current adjustment_rRepresents the parameter, θ ', of the current adjusted auto-encode module 14'_rDenotes parameters of the automatic encoding module 14 before this adjustment, μ denotes a learning rate, and α, β, γ denote weights set by the user.

Preferably, the goal of training the first training model is to train the first training model to converge, and one of the criteria is to train the first training model to minimize the weighted sum of the loss functions corresponding to the feature extraction module 11, the recognition module 12, and the domain distinguishing module 13 of the first training model, that is, to minimize the following total loss function:

L_total＝αL_task+βL_target+γL_type；

where α, β, γ represent weights set by the user. Or, α, β, γ represent weights set by the user to weigh the influence of the loss functions of the feature extraction module 11, the recognition module 12, and the domain distinguishing module 13 of the first training model.

Step A3: and (3) constructing a second training model, wherein the second training model comprises a feature extraction module 21 and an identification module 22, the initial parameters of the feature extraction module 21 of the second training model are set by adopting the parameters of the feature extraction module 11 of the first training model after the training of the step A2, and the initial parameters of the identification module 22 are set in a random initialization mode. The initial parameters of the recognition module 22 are set in a random initialization manner to generate uniformly distributed parameters, which reduces the time for training the model to converge and also makes it easier to train the model to the optimal effect rather than the suboptimal effect.

The structure of the feature extraction module 21 of the second training model is the same as that of the feature extraction module 11 of the first training model, i.e., the second training model includes a preprocessing layer, a CNN model, a Word2Vec model, and a BiLSTM model including a forward LSTM and a backward LSTM. After the training of the first training model is completed, the parameters obtained by the feature extraction module 11 for training the first training model are used for setting the feature extraction module 21 for the second training model in a transfer learning manner. In addition, the recognition module 22 of the second training model comprises a BilSTM-CRF model. Unlike the recognition module 12 of the first training model, the entity label of the target domain labeling dataset adopted by the second training model sets the label of the CRF layer of the BiLSTM-CRF model of the recognition module 22, so that named entity recognition is performed on the data of the target domain according to the entity label of the target domain.

Step A4: and (3) carrying out parameter fine adjustment on the feature extraction module 21 and the recognition module 22 of the second training model constructed in the step A3 by using a third data set in a supervised training mode, and using the second training model after parameter fine adjustment as a named entity recognition model, wherein the third data set is a target domain marking data set with entity labels and represented in a word vector form. The third data set is provided with an entity label of the target field, and can be used for performing supervised training on the second training model so as to adjust parameters of the feature extraction module 21 and the recognition module 22 of the second training model, so that the accuracy of the finally obtained named entity recognition model for performing named entity recognition on the data of the target field is further improved. It should be understood that the named entity recognition model includes a feature extraction module 31 and a recognition module 32. The feature extraction module 31 of the named entity recognition model is obtained by parameter fine-tuning of the feature extraction module 21 of the second training model. The recognition module 32 of the named entity recognition model is obtained by the recognition module 22 of the second training model after parameter fine-tuning.

The invention is illustrated below by means of a specific experimental example.

A first part: data set setup

Source domain marker data: in order to train the antagonistic learning network (the feature extraction module and the domain discrimination module of the first training model) for named entity recognition, the ontotes 5.0 english dataset was used.

Target domain marking data: for training and evaluation of the proposed model, a Ritter11 dataset was used.

Target domain unlabeled data: to train the anti-learning network to retain common features, the present invention requires the use of a dataset with large-scale unlabeled tweets; therefore, the invention constructs a large-scale Twitter field unlabeled data set from Twitter as target field unlabeled data using the interface of Twitter.

Statistics of the ontototes 5.0 and Ritter11 datasets as shown in table 1, it can be seen that the number of words (Token number) in the training dataset of ontototes 5.0 is 848,220 and the number of words in the training dataset of Ritter11 is 37,098. The verification dataset word count for the constructed Twitter domain unlabeled data is 1,177,746.

TABLE 1 Ontonotees 5.0 and Ritter11 data set statistics

	Ontototes 5.0 dataset	Ritter11 dataset
			Training data set word number	848,220	37,098
Verifying data set word number	144,319	4,461
			Testing data set word number	49,235	4,730
Training data set sentence number	33,908	1,915
			Validating data set sentence number	5,771	239
Testing data set sentence number	1,898	240
			Number of named entity categories	18	10

In the art, after a data set is obtained, the data set is generally divided into three parts shown in table 1, namely a training data set (training set for short), a verification data set (verification set for short) and a test data set (test set for short). The training set is used for training the models, and samples in the training set are used for training each model or module for multiple rounds in the training process until convergence. Satisfying any one of the following evaluation rules considers the model as trained to converge: the first evaluation rule: the number of training rounds reaches the self-defined upper limit number of rounds; the second evaluation rule is as follows: comparing the change amplitude of the F1 value corresponding to the named entity recognition model after one round of training with that after the previous round of training, with the change amplitude of the F1 value corresponding to the named entity recognition model, wherein the change amplitude is less than or equal to a preset change amplitude threshold value; the third evaluation rule: the training round number reaches the self-defined lower limit round number, and the recognition accuracy of the named entity recognition model on the verification set is not improved after a certain round of training compared with that after the previous round of training. For example, the lower limit number of rounds is set to 2, the upper limit number of rounds is set to 30, and the variation width threshold is set to ± 0.5%. The verification set is used for counting evaluation indexes, adjusting parameters and selecting algorithms. The test set is used to evaluate the performance of the model as a whole at the end.

See tables 2 and 3 below for the entity labels corresponding to the 18 named entity categories for the ontototes 5.0 dataset and the 10 named entity categories for the Ritter11 dataset in table 1, respectively. Further, non-entities are typically represented by the label o (outside).

Table 2 ontotes 5.0 dataset entity tags

Table 3 Ritter11 dataset entity tag

A second part: experimental setup

Because the tagged ontotes 5.0 dataset is more than 20 times the size of the tagged Ritter11 dataset, if the model is trained using the merged dataset directly, the final result will be more biased towards the ontotes 5.0 dataset, making the training result worse. Thus, the present invention first performs a confrontational training on the ontotes 5.0 dataset and the Twitter domain unlabeled dataset, and then uses the Ritter11 dataset to fine-tune (fine-tune) the parameters of the second training model at a low learning rate. Fine-tune (fine-tune) is the process of retraining on new data using the already trained model parameters as the starting point for training.

In the experiment of this example, the hyper-parameters used by the model are as follows: the current optimizers are mainly AdaGrad, RMSProp, Adam and AdaDelta. Comparing the effects of using each optimizer through experiments, the invention selects AdaGrad optimizer for the antagonistic learning process, the learning rate setting range is (0-1), the example uses 0.1 as the default learning rate, the fine-tune (fine-tune) process selects Adam optimizer, the learning rate setting range is (0-1), the example uses 0.0001 as the default learning rate, the early stop mechanism (early stop) training round number setting range is (0-100), the example sets 100, the Word embedding feature is trained by Word2Vec technology which is issued by Google and converts words into multi-dimensional vectors, the dimension of the Word2Vec model is set as the default 200-dimensional, the character level embedding feature is trained by CNN, the dimension setting range is (0-300), the example sets as 25-dimensional, the coding is carried out by using BilSTM, the setting range of the number of hidden neurons in each layer is (0-300), each layer in this example contains 250 hidden neurons. The decoding is performed using three layers of standard LSTM, the number of hidden neurons in each LSTM layer is set to range from (0-1000), for example, and the present example is composed of 500 hidden neurons.

The baseline experiment is the training result of the basic BilSTM-CRF model on the Ritter11 data set as shown In FIG. 4 using the feature extraction module proposed In this chapter, and is denoted as In-domain. Namely, the character level embedding characteristics and the word embedding characteristics of the samples, such as Bob Dylan visual Sweden, are spliced and input into a BilSTM layer and a CRF layer of a BilSTM-CRF model to obtain a named entity recognition result. For example, the named entity recognition results for the sample Bob Dylan visualized Sweden are respectively: B-PER I-PER O B-GPE. B-PER indicates that Bob is a human name entity (Begin), I-PER indicates that Dylan is a human name entity (Inside), O indicates that visited is a non-entity, and B-GPE indicates that Sweden is a country, city, state class entity (Begin).

In addition, the present invention also uses the existing parameter initialization method (INIT) and the multitask learning Method (MULT) as shown in fig. 5 as a comparative test. These two prior methods are described in detail below:

the parameter initialization method comprises the following steps: first using source domain training data D_STraining source model M_S. Next, a target model M is constructed_TAnd rebuilds the final CRF layer to account for the difference in output space (labels). Using learned M_SInitializing M_T(excluding the CRF layer). Finally, continuing to use the target domain training data D_TTraining M_T。

The multi-task learning method comprises the following steps: simultaneous multitask learning using D_SAnd D_TTraining M_SAnd M_T。M_SAnd M_TThe parameters of (not including the CRF layer) are shared during the training process. In some prior art schemes, the hyperparameter λ is used as the slave D_SBut not D_TThe probabilities of the instances are selected to optimize the model parameters. By selecting the hyper-parameter λ, the multitask learning process performs better in the target domain. However, the method has the defects of large source field, small target field, deviation to the source field and poor effect.

And a third part: evaluation method and index

The evaluation method adopts complete matching specified by CoNLL03 conference, namely, the boundary and the type of the entity are both matched to calculate the correct matching (correct label).

Evaluation indices Precision (Precision), Recall (Recall) and F1 values (F1-score) were used and calculated as follows:

the precision ratio is as follows:

the recall ratio is as follows:

f1 value:

wherein TP represents True Positive (TP), which means a Positive sample predicted to be Positive by the model (the entity word is correctly labeled); what may be referred to as the true rate of correctness judged to be true;

FP denotes False Positive (FP), which refers to negative examples predicted to be Positive by the model (non-solid words are labeled as entities); may be referred to as a false alarm rate;

FN denotes False Negative (FN), which refers to positive samples predicted to be Negative by the model (solid words are labeled as non-solid); which may be referred to as a false negative rate.

The experimental results for this example are shown in table 4:

TABLE 4 results of the experiment

As can be seen from Table 4, the model of the present invention (i.e., line 5 of Table 4) that uses only the feature extraction module and the NER classification module for fine-tuning (fine-tune) works similarly to the In-domain method, the INIT method, and the MULT method, because fine-tuning using only the feature extraction module and the NER classification module is essentially a standard migration learning method. On the basis, the performance can be improved by adding the domain distinguishing module and the feature extraction module to form the antagonistic learning network (namely, the 6 th row in the table 4), and the performance of the domain distinguishing module and the target domain automatic coding module (namely, the 7 th row in the table 4) is higher than that of the domain distinguishing module which is added independently, which shows that the domain-specific features can be reserved to obtain better performance by introducing the automatic coding module into the target domain. Experimental results show that the model provided by the invention can greatly help to execute cross-domain named entity recognition.

According to an embodiment of the present invention, there is provided a named entity recognition method of a named entity recognition model obtained by training a named entity recognition model training method based on the foregoing embodiment, where the named entity recognition model includes a feature extraction module 31 and a recognition module 32, and the named entity recognition method includes: b1, acquiring character level embedded features and word embedded features of the text to be recognized through the feature extraction module 31 of the named entity recognition model, and performing series concatenation to obtain word vectors of words in the text to be recognized; b2, inputting the text to be recognized represented in the form of word vector into the recognition module 32 of the named entity recognition model, and obtaining the named entity recognition result of the text to be recognized.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A named entity recognition model training method, characterized in that the method comprises:

a1, constructing a first training model, wherein the first training model comprises a feature extraction module, a recognition module and a field distinguishing module;

a2, performing multiple rounds of training on the first training model, wherein in each round of training, the first data set is used for training the recognition module, the first data set and the second data set are used for performing antagonistic training on the feature extraction module and the domain distinguishing module, parameters of the feature extraction module are adjusted at least according to the loss function of the recognition module and the loss function of the domain distinguishing module after each round of training, the first data set and the second data set are updated simultaneously, and the updated first data set and the updated second data set are used for performing the next round of training, wherein the first data set is a source domain labeled data set with entity labels represented in a word vector form, and the second data set is a target domain unlabeled data set without entity labels represented in a word vector form;

a3, constructing a second training model, wherein the second training model comprises a feature extraction module and an identification module, the initial parameters of the feature extraction module of the second training model are set by the parameters of the feature extraction module of the first training model trained in the step A2, and the initial parameters of the identification module are set in a random initialization mode;

and A4, carrying out parameter fine adjustment on the feature extraction module and the recognition module of the second training model constructed in the step A3 by using a third data set in a supervised training mode, and taking the second training model after parameter fine adjustment as a named entity recognition model, wherein the third data set is a target domain marking data set with entity labels and represented in a word vector form.

2. The named entity recognition model training method of claim 1, wherein the size of the source domain labeled dataset is the same or approximately the same as the size of the target domain unlabeled dataset, and the size of the target domain labeled dataset is smaller than the size of the target domain unlabeled dataset.

3. The training method of the named entity recognition model according to claim 2, wherein the same or approximately the same scale means that the ratio of the data amount of the labeled data set of the source domain to the unlabeled data set of the target domain is: 10: 14-10: 9.

4. The named entity recognition model training method of claim 2, wherein the feature extraction module in the first training model comprises a preprocessing layer, a CNN model, a Word2Vec model, and a BilSTM model including a forward LSTM and a backward LSTM, wherein the forward LSTM and the backward LSTM respectively comprise a plurality of sequentially connected LSTM units;

the feature extraction module respectively processes a source field marking data set, a target field unmarked data set and a target field marking data set represented in a non-word vector form to obtain a first data set, a second data set and a third data set as follows:

preprocessing words of the data set by using the preprocessing layer, wherein the words comprise uniform capital and small case and stop word removal;

extracting the character level embedding characteristics of each word in the data set by using a CNN model;

extracting Word embedding characteristics of each Word in the data set by using a Word2Vec model;

the character level embedding characteristics and the word embedding characteristics of each word in the data set are spliced in series to obtain vector representation of each word;

and (3) the vector representation of each word in the data set is input into a BilSTM model of the feature extraction module for processing to obtain the data set containing context information and represented in the word vector form.

5. The method for training a named entity recognition model according to any one of claims 1 to 4, wherein the recognition modules of the first and second training models each comprise a BilSTM-CRF model, wherein the label value space of the CRF layer of the BilSTM-CRF model of the recognition module in the first training model is set using the entity label of the source domain labeling data, and the label setting of the CRF layer of the BilSTM-CRF model of the recognition module of the second training model is set using the entity label of the target domain labeling data set.

6. The named entity recognition model training method of claim 4, wherein the first training model further comprises a gradient inversion layer, and during the countermeasure training of the feature extraction module and the domain differentiation module, the feature extraction module and the domain differentiation module of the first training model are subjected to a standard stochastic gradient descent operation through the gradient inversion layer during forward propagation, and parameters of the gradient inversion layer are automatically inverted before returning a loss function of the domain differentiation module to the feature extraction module during backward propagation, so that the feature extraction module extracts common features of words in the source domain labeled data set and the target domain unlabeled data set.

7. The method for training the named entity recognition model according to claim 6, wherein the first training model further comprises an automatic coding module, the automatic coding module is trained by using the second data set, and after each round of training, parameters of the feature extraction module are updated according to a loss function of the automatic coding module, a loss function of the recognition module and a loss function of the domain distinguishing module.

8. The named entity recognition model training method of claim 7, wherein the auto-encoding module comprises an encoder and a decoder,

in each round of training, the encoder acquires hidden states of the last LSTM in the forward LSTM and the last LSTM in the backward LSTM of the word of the unmarked data set in the target field extracted by the BiLSTM model of the feature extraction module and combines the hidden states into initial state features of the decoder, and the initial state features and the previous word embedding features are used as the input of the decoder to train the automatic coding module to extract the private features of the target field.

9. The method for training the named entity recognition model of claim 8, wherein in step a2, the parameters of the feature extraction module of the first training model are adjusted as follows:

10. The training method of the named entity recognition model according to claim 9, wherein the step a2 further comprises: after each round of training, parameters of the recognition module, the domain distinguishing module and the automatic coding module of the first training model are adjusted according to the following modes:

wherein, theta_yRepresents the parameter of the identification module after the current adjustment, theta'_yParameter, theta, representing the identification module before this adjustment_dRepresents the parameter theta 'of the adjusted domain distinguishing module'_dParameter, theta, representing the domain distinguishing module before this adjustment_rRepresents the parameter of the adjusted automatic coding module of theta'_rParameters of the automatic coding module before the current adjustment are shown, mu shows the learning rate, and α, β and gamma show the weight set by the user.

11. A named entity recognition method based on a named entity recognition model trained on the training method of the named entity recognition model according to any one of the preceding claims 1 to 10, wherein the named entity recognition model comprises a feature extraction module and a recognition module,

the named entity identification method comprises the following steps:

b1, acquiring character level embedded features and word embedded features of the text to be recognized through a feature extraction module of the named entity recognition model, and performing series connection and splicing to obtain word vectors of words in the text to be recognized;

b2, inputting the text to be recognized represented in the form of word vector into the recognition module of the named entity recognition model, and obtaining the named entity recognition result of the text to be recognized.

12. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of any one of claims 1 to 11.

13. An electronic device, comprising:

one or more processors; and

a memory, wherein the memory is to store one or more executable instructions;

the one or more processors are configured to implement the steps of the method of any one of claims 1-11 via execution of the one or more executable instructions.