CN115186670A

CN115186670A - Method and system for identifying domain named entities based on active learning

Info

Publication number: CN115186670A
Application number: CN202211092071.XA
Authority: CN
Inventors: 王海泉; 杜博文; 孙磊磊; 颜炜
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2022-10-14
Anticipated expiration: 2042-09-08
Also published as: CN115186670B

Abstract

The invention relates to a field named entity recognition method and system based on active learning, and relates to the technical field of field named entity recognition. Clustering all texts in the universal text set according to the distance between each text in the universal text set and the text of the field to be identified to obtain a text set; forming an extended text set by each text in the text set and the text of the field to be identified; performing self-supervision learning on the pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized; and constructing a domain named entity recognition model, and training the domain named entity recognition model by adopting an active learning method according to the extended text set and the text feature vector corresponding to the domain to be recognized to obtain the trained domain named entity recognition model. The method for identifying the domain named entity can transfer the general text characteristics to specific domain tasks without marking a large amount of data.

Description

Method and system for identifying domain named entities based on active learning

Technical Field

The invention relates to the technical field of named entity identification, in particular to a method and a system for identifying a named entity in the field based on active learning.

Background

In recent years, in the field of named entity recognition, deep learning based methods have dominated. The named entity recognition method using deep learning can be decomposed into three parts from an input sequence to a tag sequence: 1. distributed representation of an input, i.e., a method of converting an input into a vector, maps an input word to a low-dimensional vector while preserving its semantic properties, and vector expression methods used include word vectors, and hybrid vectors. 2. Context encoder, i.e. mining the associated information of the text by the model. The main current methods include unsupervised tasks on unlabeled data using recurrent neural networks and their variants, gated recursion units and long-short term memory networks, and Transformer-based language modeling models to learn initial parameters, encode in combination with contextual and static features. 3. And the label decoder is used for mapping the high-dimensional features output by the model to label classification, for example, a multi-layer neuron and Softmax are used as the label decoder, namely, the mapping from the output to the label is regarded as a multi-classification task, and the mapping of each output is independent. However, when the method is specifically applied to the field text, the following problems mainly exist:

1. domain text feature extraction is insufficient or relies on manual construction of domain features. The existing text feature extraction mainly relies on establishing a self-supervision task to obtain the relation between texts, and further obtains a feature vector retaining text semantic information. This class of self-supervised learning models, commonly referred to as pre-trained models, are trained on large volumes of corpus data to obtain text vectors that can be adapted to most downstream tasks. However, the domain text is usually different from the general text in terms of theme, genre, style, and the like, and therefore, only the text vector obtained by pre-training on the general text is relied on, and the task accuracy is limited. Therefore, in the named entity recognition task specific to the domain text, some domain features are often added in order to obtain higher model accuracy when extracting text feature vectors, but if the text feature vectors are refined to each domain or each task, manual construction of the text feature vectors is not easy, and therefore a method capable of automatically migrating general text features to specific domain tasks is needed.

2. The named entity recognition model based on deep learning relies on a large amount of annotation data, but the manual annotation cost is high. Deep learning based models are widely used in named entity recognition tasks because the structure of their depth enables complex features to be learned from data. However, the problem is that these models contain a large number of parameters, and in order to obtain a model with satisfactory accuracy, a large amount of labeled data must be collected first for the model to perform supervised learning, and the parameters are updated in a gradient descending manner. In the case of research and learning, there are a large number of public labeled data sets available for use. However, particularly in an application environment, the named entity recognition task cannot be limited to the recognition of entities such as people, places, dates and the like, and may be further refined according to business scenes, so that the tagged data needs to be collected first, but manual tagging needs to be performed, and manpower is also needed to be invested. Therefore, a domain named entity recognition method capable of migrating general text features to specific domain tasks without requiring a large amount of labeled data is urgently needed.

Disclosure of Invention

The invention aims to provide a domain named entity recognition method and system based on active learning, which can transfer general text characteristics to specific domain tasks and do not need a large amount of labeled data.

In order to achieve the purpose, the invention provides the following scheme:

a domain named entity recognition method based on active learning comprises the following steps:

acquiring a general text set and a text of a field to be identified;

clustering the texts in the general text set according to the distance between the texts in the general text set and the texts in the field to be recognized to obtain a text set;

determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set;

performing self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, wherein the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are connected in sequencesoftmaxA layer;

constructing a domain named entity recognition model; the domain named entity recognition model comprises a context encoder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;

and training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, wherein the trained domain named entity recognition model is used for performing domain named entity recognition on the text of the domain to be recognized.

Optionally, the clustering, according to the distance between each text in the general text set and the text of the field to be recognized, each text in the general text set to obtain a text set specifically includes:

determining a text vector of each text in the general text set and a text vector of the text of the field to be recognized;

clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be recognized;

and determining the text corresponding to each text vector in the text vector set as a text set.

Optionally, the determining the text vector of each text in the general text set and the text vector of the text in the field to be recognized specifically includes:

respectively performing word segmentation on each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text;

and respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain the text vectors of the texts in the general text set and the text vectors of the texts in the field to be recognized.

Optionally, the determining, as the text after the expansion of the field to be recognized, each text in the text set and the text of the field to be recognized form an expanded text set, specifically includes:

respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector;

and determining texts corresponding to the text vectors and the texts in the field to be recognized as the texts in the field to be recognized after being expanded to form an expanded text set.

Optionally, the training of the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by using the active learning method to obtain the trained domain named entity recognition model specifically includes:

under the current iteration times, inputting a text feature vector corresponding to a field to be recognized and the text into the field named entity recognition model to obtain a label sequence of the text and the prediction probability of each label in the label sequence under each participle in a participle set corresponding to the text for any text in the extended text set; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled;

determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;

sorting all texts in the extended text set in a descending order according to the information content of each text in the extended text set;

selecting the first M texts to label the domain named entities to obtain the labeled texts;

and training the domain named entity recognition model according to the labeled text to obtain a domain named entity recognition model under the next iteration number, determining the unlabeled text in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.

A domain-named entity recognition system based on active learning, comprising:

the acquisition module is used for acquiring a general text set and texts in a field to be identified;

the clustering module is used for clustering all texts in the universal text set according to the distance between each text in the universal text set and the text of the field to be identified to obtain a text set;

the expansion module is used for determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set;

a pre-training module for performing self-supervised learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, wherein the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector connected in sequencesoftmaxA layer;

the construction module is used for constructing a domain named entity recognition model; the domain named entity recognition model comprises a context coder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;

and the training module is used for training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the field to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, and the trained domain named entity recognition model is used for recognizing the domain named entity of the text of the field to be recognized.

Optionally, the clustering module specifically includes:

the text vector calculation unit is used for determining the text vectors of all texts in the general text set and the text vectors of the texts in the field to be recognized;

the clustering unit is used for clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be identified;

and the text set determining unit is used for determining the text corresponding to each text vector in the text vector set as a text set.

Optionally, the text vector calculating unit specifically includes:

the word segmentation subunit is used for respectively segmenting each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text;

and the text vector calculation subunit is used for respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain text vectors of the texts in the general text set and text vectors of the texts in the field to be identified.

Optionally, the expansion module specifically includes:

the encoding unit is used for respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector;

and the extension unit is used for determining the text corresponding to each text vector and the text of the field to be recognized as the text after the field to be recognized is extended to form an extended text set.

Optionally, the training module specifically includes:

a probability determining unit, configured to, for any text in the extended text set, input a text feature vector corresponding to a field to be identified and the text into the field named entity identification model under the current iteration number to obtain a tag sequence of the text and a prediction probability of each tag in the tag sequence under each participle in a participle set corresponding to the text; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled;

the information quantity calculation unit is used for determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction tags in the prediction tag sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;

the sorting unit is used for sorting all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set;

the marking unit is used for selecting the first M texts to mark the domain named entities to obtain marked texts;

and the training unit is used for training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the text which is not labeled in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: clustering all texts in the universal text set according to the distance between each text in the universal text set and the text in the field to be identified to obtain a text set; determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set; performing self-supervision learning on the pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized; the method comprises the steps of constructing a domain named entity recognition model, training the domain named entity recognition model according to an extended text set and a text feature vector corresponding to a domain to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, clustering general texts according to the distance between the general texts and the domain texts to obtain extended texts, transferring the general text features to a specific domain task, training the model by adopting the active learning method, replacing part of the cost of manual marking with the calculation power of the model, and obtaining a high-precision model by using the marked texts as few as possible.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a detailed flowchart of a domain-named entity recognition method based on active learning according to an embodiment of the present invention;

FIG. 2 is a general flowchart of a domain named entity recognition method based on active learning according to an embodiment of the present invention;

FIG. 3 is a block diagram of a pre-training model;

FIG. 4 is a block diagram of a domain named entity recognition model;

FIG. 5 is a flow chart for training a model using active learning.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

As shown in fig. 1, an embodiment of the present invention discloses a domain named entity identification method based on active learning, including:

step 101: and acquiring a general text set and a text of the field to be recognized.

Step 102: and clustering the texts in the universal text set according to the distance between the texts in the universal text set and the texts in the field to be recognized to obtain a text set.

Step 103: and determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set.

Step 104: self-supervision learning is carried out on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are connected in sequencesoftmaxA layer.

Step 105: constructing a domain named entity recognition model; the domain named entity recognition model comprises a context coder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model.

Step 106: and training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, wherein the trained domain named entity recognition model is used for performing domain named entity recognition on the text of the domain to be recognized.

In practical application, the clustering each text in the universal text set according to the distance between each text in the universal text set and the text of the field to be recognized to obtain a text set specifically includes:

and determining a text vector of each text in the general text set and a text vector of the text of the field to be recognized.

And clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be recognized.

In practical application, the determining the text vector of each text in the general text set and the text vector of the text in the field to be recognized specifically includes:

and respectively carrying out word segmentation on each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text.

And respectively inputting the word segmentation set corresponding to each text into an encoder to obtain a text vector of each text in the general text set and a text vector of the text in the field to be identified.

In practical application, the determining, as the text after the expansion of the field to be recognized, each text in the text set and the text of the field to be recognized form an expanded text set, specifically includes:

and respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector.

In practical application, the training of the domain named entity recognition model by the active learning method according to the extended text set and the text feature vector corresponding to the domain to be recognized to obtain the trained domain named entity recognition model specifically includes:

under the current iteration times, inputting a text feature vector corresponding to a field to be recognized and the text into the field named entity recognition model to obtain a label sequence of the text and the prediction probability of each label in the label sequence under each word in a word segmentation set corresponding to the text for any text in the extended text set; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled. Specifically, a text feature vector corresponding to a field to be recognized and a word segmentation set corresponding to the text are input into the field named entity recognition model, and a label corresponding to each word segmentation and a prediction probability of each label under each word segmentation are obtained.

Determining the information content of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles.

And sequencing all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set.

And selecting the first M texts to label the domain named entities to obtain the labeled texts.

The embodiment has the following technical effects:

1. because the prior art needs to be further refined according to the service scene when the entity naming recognition model is trained, labeled data needs to be collected firstly, but manual labeling needs to be invested, active learning is provided, the cost of part of manual labeling is replaced by the calculation power of the model, and a high-precision model is obtained by labeling texts as few as possible.

2. The embodiment provides a text extension method aiming at named entity recognition in a specific field. The text is converted into the feature vector, the distance of the feature vector of the text is calculated, the degree of correlation between the text and the field is determined under the conditions of no language material source and different styles of subject matters, the language material data aiming at the specific field is expanded according to the secondary correlation, and the function of searching the text data in the specific direction in a complicated universal data set is realized.

3. The embodiment determines the specific definition and calculation method of the optimal text in the active learning strategy, constructs a model iteration method based on the optimal text, determines the information content in the text by calculating the information quantity borne by different texts, and determines a text screening scheme by taking the information content as a measurement standard, thereby effectively reducing the number of training sets required by model training in the named entity recognition process, improving the model training efficiency and reducing the model training cost.

The invention also provides a field named entity recognition system based on active learning aiming at the method, which comprises the following steps:

and the acquisition module is used for acquiring the general text set and the text of the field to be identified.

And the clustering module is used for clustering all texts in the universal text set according to the distance between each text in the universal text set and the text of the field to be recognized to obtain a text set.

And the expansion module is used for determining each text in the text set and the text of the field to be recognized as the expanded text of the field to be recognized to form an expanded text set.

The pre-training module is used for carrying out self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are sequentially connectedsoftmaxAnd (3) a layer.

The construction module is used for constructing a domain named entity recognition model; the domain named entity recognition model comprises a context encoder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model.

As an optional implementation manner, the clustering module specifically includes:

and the text vector calculation unit is used for determining the text vector of each text in the general text set and the text vector of the text in the field to be recognized.

And the clustering unit is used for clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be identified.

As an optional implementation manner, the text vector calculation unit specifically includes:

and the word segmentation subunit is used for performing word segmentation on each text in the general text set and the text in the field to be recognized respectively to obtain a word segmentation set corresponding to each text.

And the text vector calculation subunit is used for respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain the text vectors of the texts in the general text set and the text vectors of the texts in the field to be recognized.

As an optional implementation manner, the expansion module specifically includes:

and the encoding unit is used for respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector.

And the expansion unit is used for determining the texts corresponding to the text vectors and the texts in the field to be recognized as the expanded texts in the field to be recognized to form an expanded text set.

As an optional implementation manner, the training module specifically includes:

a probability determining unit, configured to, for any text in the extended text set, input a text feature vector corresponding to a field to be identified and the text into the field named entity identification model under the current iteration number to obtain a tag sequence of the text and a prediction probability of each tag in the tag sequence under each word in a word segmentation set corresponding to the text; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled.

The information quantity calculation unit is used for determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction tags in the prediction tag sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles.

And the sequencing unit is used for sequencing all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set.

And the marking unit is used for selecting the first M texts to mark the domain named entities to obtain marked texts.

And the training unit is used for training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the unlabeled text in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.

The invention provides a more specific embodiment for training a domain named entity recognition model, and the overall design of the embodiment is shown in FIG. 2.

As shown in fig. 2, the domain-oriented named entity identification method based on active learning mainly includes three parts: 1) And extracting text features based on the domain self-adaptive pre-training. 2) Named entity recognition models based on deep learning. 3) Text screening algorithm based on active learning. The three parts in the frame jointly realize the functions of text feature expression, named entity identification, active learning text screening and the like.

In the embodiment, the number of the pre-trained texts is increased by performing corpus expansion on the field text, and then the field text features are extracted through pre-training. The pre-trained model will be used to initialize the input vector and context encoders of the named entity recognition model. The named entity recognition model maps the text into vectors, then maps the vectors onto entity labels to obtain the prediction probability of each entity, and then the prediction probability is used for calculating the text information amount in the active learning strategy. And the active learning strategy is used for marking the selected text with the largest information quantity. Finally, the annotation text is used to update the model until the model converges or a satisfactory criterion is reached, and the process ends. And finally obtaining a named entity recognition model optimized aiming at the domain text.

1) Extracting text features based on field adaptive pre-training:

first, for the text (text of the field to be recognized and a plurality of general texts)xThe word segmentation is performed respectively. If the data set is English, word segmentation can be carried out according to the blank space. If the data set is Chinese, performing word segmentation on the Chinese text by using a word segmentation tool jieba to obtain a word segmentation set corresponding to each text, removing some words without any real meaning by using a Chinese common word stop list, and finally generating a word list with the length of V. Each text

The numerical value of the word in the sequence is the frequency of the word in the text, and the numerical value of the unrealistic word which does not appear in the sequence is 0, wherein

Presentation pair

And performing word segmentation to obtain the nth word.

And secondly, expanding the domain text. By constructing the encoder, the text can be encoded into a form that is easy to represent, and this form can be decoded back to the original real text as losslessly as possible. And respectively inputting the word segmentation set corresponding to each text in the first step into an encoder to obtain a high-dimensional vector, wherein the vector contains the subject information of the text. The present embodiment introduces noise parameters in the process of constructing the encoder, so that for each text, its encoding will cover the whole encoding space, and the probability of encoding is the highest near the original encoding, and the farther away from the original encoding point, the lower the encoding probability. And then, determining which texts in the universal texts are the domain-related texts by distance measurement on the vector output by the encoder, wherein in the distance measurement, the scheme adopts a kNN algorithm, namely for the texts in each domain, k universal texts with the nearest distance are searched to serve as the related texts, and the distance measurement adopts Euclidean distance. And inputting the universal text vector obtained by clustering into a decoder, decoding and restoring the universal text vector into an original text, and supplementing the text into the field text to realize the expansion of the field text.

And thirdly, pre-training the domain text. The trained pre-training model is obtained by carrying out a self-supervision learning task on the pre-training model on the expanded domain text, and a text feature vector suitable for the domain is obtained in the training process. The self-supervision learning task of the pre-training model adopts a random covering task, namely, randomly covering some words from the input corpus and predicting the words through the context. When the model is trained, a sentence is input into the model for parameter learning for a plurality of times, in order to enable the downstream task to see the covered original word with probability, after the word to be covered is determined, the word is directly replaced by [ Mask ] in 80%, the word is replaced by any other word in 10%, and the original word is kept in 10%.

The input vector representation of the pre-trained model is the unit sum of three feature vectors: respectively, a word vector, a position vector, and a segmentation vector. For Chinese, a word vector refers to the vector representation of a word, and for English, a word vector refers to the vector representation of each word after WordPiece participle. The position vector is to encode the position information of the word into a feature vector to make up for the loss caused by abandoning the traditional RNN and CNN structures. The vector is split to distinguish whether two sentences, e.g., B, are the context of a.

FIG. 3 illustrates the structure of the pre-training model, the lowest input vector representation is the above three embedded feature unit sums, trm represents a Transformer unit, ti represents the feature vector of the ith character, and the output layers are a feed-forward neural network and a Softmax layer. In the pre-training task, ti will be used to predict the original word, and the goal of the pre-training task is to reduce the cross-entropy between it and the real word.

2) Named entity recognition

As shown in fig. 4, the domain-named entity recognition model mainly includes two modules: a context encoder and a tag decoder.

The encoder still adopts the structure of the Transformer stacked in the pre-training model, and the initialization parameters of the encoder are all derived from the pre-training model. The Transformer is a model for generating vectorized representation of input and output completely through a self-attention mechanism, and does not depend on a traditional recurrent neural network or a convolution network, so that the advantages are brought: firstly, the calculation of the next time slice does not depend on the previous time slice, and the parallel capability of the model is exerted. And secondly, the distance of any position of the sequence is constant, so that the problem of long-term dependence is solved. The traditional neural network is abandoned, and the model possibly loses the capability of capturing local features to a certain extent, so that the position information of characters is added into the input feature vector to make up.

The main goal of the context encoder module is to map the input text into text feature vectors. Assume the original text is

The mapped feature vector can be expressed as

Wherein

And the feature vector of the ith word segmentation after the word segmentation of the text x is represented. As noted above, tonken1 in FIGS. 3 and 4 is the 1 st participle in xx ₁ The initial word vector is the text characteristic vector after the domain text is pre-trained, the initial word vector is fine-tuned in a downstream task together with the Transformer structure parameters, the text characteristic vector output by the module not only comprises the characteristics of the text, but also comprises the relation between the text and the label, and when the domain named entity recognition model is used for recognizing the text, the initial word vector is used for judging whether each participle of the text is related to the label. After the context encoder is input, a new feature vector T is generated to be used as the input of a label decoder, and the final predicted label probability is obtained through the decoder.

The main goal of the tag decoder module is to map the text feature vectors T onto the entity tag sequence, which can be considered a multi-classification task. Hypothesis text

The tag sequence of (A) is

And if there are k labels in total, the text feature vector is reduced to k dimension after passing through the linear layer, and the prediction probability of each label under each participle is obtained after passing through the Softmax layer, wherein

Presentation label

In word segmentation

The probability of prediction of the following is,

to get word segments

Predictive tagging of

=max

The model will use a cross-entropy loss function to optimize the performance of the network.

For a given data set

The objective function of the network is the formula:

and training by taking the function as a target during subsequent model training.

The network realizes that under the condition of considering text field characteristics, sequence characteristics and text-label mapping, the context characteristic vector of the text is extracted by adopting a network structure of a self-attention mechanism, the mapping relation between the context characteristic vector and the label sequence is mined by taking reduction of reality and prediction of label difference as targets, and thus the task of named entity identification is completed.

3) Text screening method based on active learning

The screening method mainly aims to achieve the highest model accuracy with the least possible labeled data. By the above tag decoder, text can be acquired

Corresponding predicted tag sequences

And probability of each predictive label

For each text in UxAmount of information

Defined as the formula:

wherein

A set of probabilities representing a predicted tag sequence for text sequence x.

I.e. predicted tag probability output by tag decoder

Are independent of each other, so the following formula is given:

in order that the strategy does not favor longer sequences, normalizing the above equation has the formula,

the best text defined by the policy is

I.e. (MNLP: maximum Normalized Log-Probability), then will be pressed

The text is sorted in descending order. And after each round of model training is converged, selecting a plurality of texts in front for labeling and learning in the next round.

After the text information amount is defined, the information amount of each text which is not marked can be measured, so that the best text is screened out for model updating. The updating mode of the model in the scheme is full updating, namely, all the labeled texts are used as a training set to carry out model training in the next round, and training is started from the best model stored in the previous round in an iterative manner. FIG. 5 illustrates a flow of model updates in an active learning process.

First, the model is trained on the labeled text to converge. And then dividing the unlabeled texts into a plurality of batches with the size of B, acquiring the probability of the predicted entity label of each batch of texts through a model, and calculating the information content of each text. And then sequencing all the texts, selecting the first M texts with the information quantity from large to small for labeling, adding a labeling set, and performing the next round of training. And so on until reaching a certain stopping standard. Finally, a named entity recognition model with satisfactory precision is obtained under the condition that the text is labeled as little as possible.

Stopping criteria are determined by the specific situation, such as the F-score on the model reaches satisfactory precision, or the cost of continuous labeling is far more than the improvement of the model precision, according to the practical principle, 25%,50%,75% and 100% of the training set are generally used for evaluation, and if the F-score of the last 25% is not obviously improved, the labeling is not carried out.

The invention has the following technical effects:

the domain-oriented named entity recognition method based on active learning provided by the embodiment can extract text context features through domain text pre-training, realize entity prediction through a named entity recognition model, screen out the best text for marking according to an active learning strategy, and update the model.

The word vectors are used for evaluating the distance between the data in the universal text data set and the materials in the specific field, so that the automatic expansion of the data set in the specific field by using the multi-source universal data set is realized, and the automatic migration of the universal model to the model in the specific field is realized.

The invention provides a feature automatic migration mode based on the difference between a field text and a general text, which is used for extracting and expressing the field text features; and simultaneously, based on a named entity recognition algorithm of deep learning, establishing a context relation of text features, further mapping the context relation to an entity label, completing a named entity recognition task, and selecting a text with the most information quantity for marking based on an active learning strategy, so that an active learning framework for field-oriented named entity recognition is constructed, and a high-precision model is obtained by marking the text as few as possible. The invention solves the problem of how to automatically migrate the general text features into a specific field and fit the depth model with as few annotated texts as possible.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A domain named entity recognition method based on active learning is characterized by comprising the following steps:

acquiring a general text set and a text of a field to be identified;

self-supervision learning is carried out on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are connected in sequencesoftmaxA layer;

2. The method for recognizing a domain named entity based on active learning according to claim 1, wherein the clustering of the texts in the generic text set according to the distance between the texts in the generic text set and the text of the domain to be recognized specifically comprises:

and determining texts corresponding to the text vectors in the text vector set as a text set.

3. The method as claimed in claim 2, wherein the determining the text vector of each text in the generic text set and the text vector of the text in the domain to be recognized specifically includes:

4. The method according to claim 3, wherein the determining of each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set comprises:

inputting each text vector in the text vector set into a decoder respectively to obtain a text corresponding to each text vector;

5. The method for recognizing the domain-named entity based on active learning according to claim 1, wherein the method for active learning is used for training the domain-named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized to obtain the trained domain-named entity recognition model, and specifically comprises:

determining the information content of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;

and training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the unlabeled text in the expanded text set as the expanded text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.

6. A domain-named entity recognition system based on active learning, comprising:

the acquisition module is used for acquiring a general text set and a text of a field to be identified;

the clustering module is used for clustering the texts in the general text set according to the distance between the texts in the general text set and the texts in the field to be identified to obtain a text set;

the pre-training module is used for carrying out self-supervision learning on the pre-training model according to the extended text set to obtain the trained pre-training model and the field to be identifiedCorresponding text feature vectors, the pre-training model comprises a context encoder, a feedforward neural network andsoftmaxa layer;

the construction module is used for constructing a domain named entity recognition model; the domain named entity recognition model comprises a context encoder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;

7. The system according to claim 6, wherein the clustering module specifically comprises:

8. The system according to claim 7, wherein the text vector calculation unit specifically includes:

9. The system of claim 8, wherein the expansion module comprises:

10. The system according to claim 6, wherein the training module specifically comprises:

the information quantity calculating unit is used for determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;