CN115130475A

CN115130475A - Extensible universal end-to-end named entity identification method

Info

Publication number: CN115130475A
Application number: CN202210617397.3A
Authority: CN
Inventors: 李祥学; 李轩舟
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-09-30

Abstract

The invention discloses an extensible universal end-to-end named entity recognition method which comprises a text preprocessing process, a model M, a training process and a prediction and entity analysis process by using the model M, wherein the model M comprises an input layer, a context coding layer and a graph modeling layer. Text preprocessing: receiving a text input and an entity category, and generating an input sequence; the training model comprises acquiring a data set, converting the data set into a training set, and performing multiple rounds of training on the model by using the training set; after the training of the model is finished, the input sequence processed in the text preprocessing process is input into the model M, the graph modeling layer of the model M calculates the connection relation between words, and finally the entity identified in the graph is analyzed. The method can adapt to the recognition under the conditions of entity overlapping and entity non-continuity, and can adapt to the condition of demand change such as newly added entity types without modifying the model structure, thereby being easy to expand and carry out field transfer learning.

Description

Extensible universal end-to-end named entity identification method

Technical Field

The invention relates to the technical field of natural language processing, in particular to an extensible universal end-to-end named entity identification method.

Background

Named Entity Recognition (NER) is an important component of natural language processing. Named entity recognition refers to a process of recognizing names or symbols of things with specific meanings in texts, and named entities mainly comprise names of people, places, organizations, dates, proper nouns and the like. Many downstream NLP tasks or applications rely on the NER for information extraction, such as question answering, relationship extraction, event extraction, and entity linking. If the named entities in the text can be recognized more accurately, the computer can better understand the semantics of the language and better execute tasks, so that the human-computer interaction experience is improved.

The Transformer is proposed in a paper Attention is all you needed by the Google Brain team in 2017, and is constructed by utilizing an Attention mechanism and full connection, and a serial computing structure different from a recurrent neural network can fully utilize parallel computing.

The named entity recognition method based on the deep neural network generally regards named entity recognition as a multi-classification task or a sequence tagging task and can be divided into an input presentation layer, a context coding layer and a label decoding process, wherein the input presentation can be divided into three processes of character level, word level and mixing according to a coding object, and vector presentation of each word can be obtained; semantic coding generally applies a deep neural network, such as a Bidirectional long and short memory neural network, a transform-based Bidirectional Encoder Representation (BERT), a transition learning network, and the like, so that a word vector of each word in a text contains context information; tag decoding is done by a classifier, which usually uses a fully-connected neural network + Softmax layer or a conditional random field + Viterbi algorithm (Viterbi algorithm) to derive the tag for each word.

The sequence-labeled named entity recognition model mostly uses the CRF as a label decoding layer, global optimization is carried out by adding label transfer score matrix parameters and the prediction scores for defining the sequence, but when the number of labels is large, the performance of the CRF is obviously reduced and the time complexity is high. Later, span-based identification methods appeared, and computing the starting and ending positions of entities can solve the entity identification under the condition of entity overlapping, but can not identify discontinuous entities.

In reality, named entity recognition scenes are mostly accompanied by situations of corpus shortage and demand change, for example, situations of demand change require adding one more entity category. One possible approach is to use the trained weights of the model to initialize the weights of the model using the learned knowledge of the model trained under other similar scenario tasks. Most named entity recognition models and final classification layer structures are specially designed for one application scene, and different output dimensions of the classification layers are caused by different entity category numbers. Due to the difference of task scenes, the output dimensionalities of the last layer of the two models are different with high probability, the last layer of the models needs to be discarded, and the learned weight of the previous layer is used as the training starting point of the models.

Disclosure of Invention

In view of the above-mentioned problems, an object of the present invention is to provide an extensible universal end-to-end named entity identification method, which can be applied to entity identification under different task scenarios with different entity types or under different changing conditions without modifying models, and therefore, is very easy to extend and migrate to other fields. After the model is trained in other task scenes, the next scene task training can be directly carried out without discarding the classification layer depending on the task scene finally, so that the knowledge learned from other tasks by the model is utilized. Under the condition of changing the requirements, for example, under the condition of newly adding several types of entities, the model does not need to be modified for retraining, and only the newly added several types of entities need to be provided with training data.

The method and the device are suitable for entity identification under the discontinuous condition and also suitable for entity identification under the entity overlapping condition.

The specific technical scheme for realizing the purpose of the invention is as follows:

an extensible universal end-to-end named entity recognition method comprises the following specific steps:

step 1: the text preprocessing process generates an input sequence, and specifically comprises the following steps:

receiving a text input and an entity type, adding a symbol at the head and the tail of the text respectively, and adding the entity type at the tail of the text;

segmenting the input text with the symbols and the entity categories added at the head and the tail to obtain a word sequence;

mapping the word sequence into numbers, mapping the numbers and the words one by one to meet a bijective relation, and outputting the mapped number sequence as an input sequence;

step 2: constructing a model M, comprising:

receiving an input sequence output in a text preprocessing process by using a context coding layer, generating a word vector group by using a self-attention mechanism, and discarding word vectors corresponding to entity category names;

modeling a directional connection relation between words by using a directional connection diagram, calculating the directional connection diagram between the words by using a word vector group, representing the directional connection diagram by using a matrix, and outputting the directional connection diagram as a diagram represented by the matrix;

and 3, step 3: training a model M;

and 4, step 4: predicting by using the model M;

and 5: and (3) carrying out entity analysis on the output of the model in the step (4), specifically:

and receiving a graph output by the model M, traversing the whole graph from the head, except for a path in which a head symbol is directly connected to a last tail symbol, wherein each word which starts with the head symbol and is corresponding to the path which ends with the last tail symbol is an entity belonging to the category in the path sequence combination, and outputting the entity set which is analyzed.

Step 2, modeling the directed connection relation between the entity words by using the directed connection graph, and calculating the directed connection graph between each word by using the word vector group, wherein the method specifically comprises the following steps:

if a word is the beginning of an entity, establishing a sentence start symbol connected to the word directed edge;

if a word B follows the word A in an entity, establishing a directed edge connected to the word B by the word A;

if a word is the end of an entity, establishing a directed edge connecting the word to a tail symbol;

the words except the head and tail symbols are called intermediate words, and the corresponding word vectors are intermediate word vectors;

calculating the connection relation between the initial symbol and the intermediate word by using the first word vector and the intermediate word vector to represent the probability of starting an entity by the intermediate word;

calculating the connection relation between two words by using the word vectors of any two intermediate words;

calculating the connection relation between the intermediate words and the sentence end symbols by using the end word vectors and the word vectors of the intermediate words, and representing the probability of ending an entity by using the intermediate words;

and after the calculation is finished, obtaining a directed connection graph represented by a matrix between the words.

The training model M specifically comprises the following steps:

acquiring a labeled data set, wherein each piece of data in the data set comprises a text t and a label y, all entity category names contained in the text and a corresponding entity set are recorded in the label y, and if the text t does not contain any type of entity, the label y is empty;

converting the data set to a training set:

defining all entity category names appearing in a data set as a set S, and setting the set S to contain n entity category names; for each piece of data (t, y) in the data set, t is a text, y is a label, for each category S in the set S, if the label y contains an entity of the corresponding category S, that is, the text t contains an entity non-empty set e belonging to the category S, the category S and the corresponding entity set e are used as labels y ', and the text t and the label y' are used as a piece of data in a training set; if the text t does not contain an entity belonging to the category s, taking the category s and a corresponding entity empty set e ' as a label y ', and taking the text t and the label y ' as a piece of data in a training set;

performing multiple rounds of training on the model M using the training set, each round of training comprising:

dividing the data of the training set into a plurality of batches, extracting a batch of data from the training set each time, and generating a true value of an adjacency matrix of the batch of data by using an entity set in a label for each extracted batch of data;

for each piece of data, processing the entity categories in the text and the label in the piece of data by using the text preprocessing process in the step 1 to generate an input sequence;

inputting the input sequence into a model, calculating the connection relation among all words including symbols by the model, and outputting an adjacency matrix;

and finally, calculating loss by using a matrix predicted by the model and a true value matrix generated by the label, and updating the weight of the model according to the loss.

The prediction by using the model M specifically comprises the following steps:

inputting a piece of text to extract entities therein, without including tags and other information;

selecting an entity category to determine an entity category to search for in the text;

inputting the text and the entity type into the text preprocessing process in the step 1 to obtain an input sequence;

the input sequence is input into a model M, which outputs a graph of the adjacency matrix representation.

Compared with the prior art, the invention provides the extensible universal end-to-end named entity identification method which can be suitable for different task scenes without modifying the model, so that the method can be easily transferred to other fields. After the model in the method is trained in other task scenes, a classification layer depending on the task scene at last does not need to be lost, the model does not need to be modified, and the model can be directly used as the model of the next scene task for training, so that the knowledge learned from other tasks by the model can be utilized. Under the condition of changing requirements, such as newly adding several types of entities, the conventional model needs to modify the output of the last layer of classification and retrain, but the model of the method does not need to modify the model for retrain, and only needs to provide training data aiming at the newly added several types of entities. The method utilizes the directed graph to model the connection relation between entity words, is suitable for entity recognition under the discontinuous condition and is also suitable for entity recognition under the entity overlapping condition.

Drawings

FIG. 1 is a block diagram of a named entity recognition model according to an embodiment of the present invention;

FIG. 2 is a text pre-processing process flow diagram of an embodiment of the invention;

FIG. 3 is a flow chart of an embodiment of the present invention for converting a data set to a training set;

FIG. 4 is a training flow diagram of a method of named entity recognition in accordance with an embodiment of the present invention;

FIG. 5 is a flow diagram of a method of named entity identification, according to an embodiment of the invention.

Detailed Description

Examples

The method of the present invention is illustrated below, and for simplicity, the word segmenter treats each Chinese character as a word, using [ S ] and [ E ] as the leading and trailing characters:

the embodiment provides an extensible universal end-to-end named entity identification method, which comprises the following steps:

(1) the text preprocessing program receives two inputs, including an input text and an entity category, as shown in fig. 2, for the input text, adding a symbol [ S ] and [ E ] at the beginning and the end of the text respectively, the entity category is used as a prompt for the entity category to be extracted from the input text by a model, the text and the entity category with the symbol added at the beginning and the end are used, the text input is converted into a word sequence through word segmentation and dictionary mapping, the word sequence is mapped into numbers, the numbers and the words are mapped one by one to meet a bijective relation, and the mapped number sequence is used as an input sequence;

(2) building a model, as shown in fig. 1, the model is divided into an input layer, a context coding layer and a graph modeling layer, in the embodiment, the model uses BERT as the context coding layer, an input sequence processed by a text preprocessing program is input into the context coding layer, each word vector in a word vector group output by the BERT layer of the context coding layer is 768 dimensions, the word vectors are reduced to 64 through a full connection layer, finally, the connection relation between words is calculated by using the word vector group, a Sigmoid function is used as an activation function of the last layer, a directed connection graph represented by an adjacent matrix is modeled, a main diagonal and a part below the main diagonal are set as 0, and only a part above the main diagonal is used;

(3) acquiring a labeled data set, wherein each piece of data in most of the currently disclosed data sets comprises a text t and a label y, all entity categories and corresponding entity sets contained in the text are recorded in the label y, and if the text t does not contain any type of entities, the label y is empty;

(4) as shown in fig. 3, the data set is converted into a training set: defining all entity category names appearing in a data set as a set S, and setting the set S to contain n entity category names; for each piece of data (t, y) in the data set, t is a text, y is a label, for each category S in the set S, if the label y contains an entity of the corresponding category S, that is, the text t contains an entity non-empty set e belonging to the category S, the category S and the corresponding entity set e are used as labels y ', and the text t and the label y' are used as a piece of data in the training set; if the text t does not contain an entity belonging to the category s, taking the category s and a corresponding entity empty set e ' as a label y ', and taking the text t and the label y ' as a piece of data in a training set;

(5) as shown in fig. 4, the set-up model is subjected to multiple rounds of training using a training set, each round of training comprising:

(51) dividing the data of the training set into a plurality of batches, extracting a batch of data from the training set each time, and generating a true value of the adjacency matrix by using an entity set in the label for each extracted data;

(52) processing the piece of data by using a text preprocessing program, adding an [ S ] symbol at the head of the text of the piece of data, adding an [ E ] symbol at the tail of the text, taking out the category name in the label of the piece of data, splicing the category name after the [ E ], performing word segmentation on the text added with the special symbol and the category name, and generating an input sequence;

(53) inputting the input sequence into a model, calculating the connection relation among all words including special characters by the model, and outputting an adjacency matrix;

(54) and finally, calculating the binary cross entropy loss by using a matrix predicted by the model and a true value matrix generated by the label, and updating the weight of the model according to the loss.

(6) After model training is finished, the model is used for prediction, and the specific process is as follows:

(61) selecting a piece of text to extract an entity therein;

(62) selecting an entity category to determine an entity category to search for in the text, which may be an entity category that has not been present in the training set;

(63) inputting the text and the entity type into a text preprocessing process to obtain an input sequence;

(64) inputting the input sequence into a model M;

(65) the entities identified in the graph are parsed using an entity parsing process.

Through the steps (1) to (5), the model is trained, when entity recognition is carried out on a text, each entity type in the entity type set needs to be recognized once in sequence, the model only gives the entities belonging to the category each time, and the entity contained in the text and recognized by the model can be obtained by analyzing the path of the graph output by the model.

The invention also takes the category of the named entity as a part of input, and aims to extract the entity belonging to the category in the text so as to input the text' Xiaoming is on A soft work and Xiaohong is on B song work. "for example, adding a special character of beginning and end [ S]And [ E]Then, the "company" is spliced after the text as the category name so that the text becomes "[ S ]]Xiaoming is on A soft work and Xiaohong is on B song work. [ E ]]The company "," soft a "and" song B "are entities belonging to the category of the company that this text contains, and the model is targeted to output" soft a "and" song B ". Starting with 0 as the index, "[ S ]]Xiaoming is on A soft work and Xiaohong is on B song work. [ E ]]"in," A "and" soft "at positions 4 and 5, respectively," B "and" Song "at positions 12 and 13, respectively," [ S]"and" [ E ]]"two special characters are at position 0 and position 17, respectively, and the adjacency matrix E is 17Matrix of 17 dimensions, E _0,4 、E _4,5 、E _5,17 Has a value of 1, E _0,12 、E _12,13 、E _13,17 Corresponds to 1, E _0,4 、E _0,12 Denotes the beginning words with "A" and "B", respectively, E _5,17 And E _13,17 Indicating "soft" and "song" as ending words of an entity, and E _4,5 And E _12,13 The next word to represent "a" is "soft" and the next word to represent "B" is "song", respectively. The first row of the matrix records the beginning words of the entities and the last column records the ending words of the entities. Although "Xiaoming" and "Xiaohong" belong to the category of the name of a person, in the case where the category of "company" is input, these two words should not appear in the output graph.

It can be seen from the above examples that the connection between modeling words can easily cope with the overlapping and non-continuous situations of entities, where entities with overlapping parts include a common path in the graph, and if the entities are non-continuous, even if two words are separated by a distance, there is a connection relationship, and the entities in the graph output by the model are paths starting with "[ S ]" and ending with "[ E ]".

In the case of using the category information as part of the input, categories that do not appear in the training set may be merged as category information at the end of the text and input into the model, and in the case of training the model using the clenenr 2020 dataset, the dataset contains 10 categories, each of which is: name, address, organization, company, government, book, game, movie, job, sight. The big grassland is marked with a plurality of animals, such as lions, tigers and antelopes. [E] The animal is input, and the model correctly outputs lion, tiger and antelope, which shows that the model can predict unseen entity classes by using the knowledge trained and learned on other tasks.

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above examples are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the appended claims.

Claims

1. An extensible universal end-to-end named entity recognition method is characterized by comprising the following specific steps:

step 2: constructing a model M, comprising:

receiving an input sequence output in a text preprocessing process by using a context coding layer, generating a word vector group by using a self-attention mechanism, and discarding a word vector corresponding to an entity class name;

and step 3: training a model M;

and 4, step 4: predicting by using the model M;

2. The method for identifying generic end-to-end named entities according to claim 1, wherein step 2 is to model the directed connection relationship between entity words by using a directed connection graph, and calculate the directed connection graph between words by using a word vector group, specifically:

if a word is the beginning of an entity, establishing a period head symbol connected to the word directed edge;

3. The method for identifying generic end-to-end named entities as claimed in claim 1, wherein the training model M is specifically:

acquiring a labeled data set, wherein each piece of data in the data set comprises a text t and a label y, all entity categories and corresponding entity sets contained in the text are recorded in the label y, and if the text t does not contain any type of entity, the label y is empty;

converting the data set to a training set:

defining all entity category names appearing in a data set as a set S, and setting the set S to contain n entity category names; for each piece of data (t, y) in the data set, t is a text, y is a label, for each category S in the set S, if the label y contains an entity of the corresponding category S, that is, the text t contains an entity non-empty set e belonging to the category S, the category S and the corresponding entity set e are used as labels y ', and the text t and the label y' are used as a piece of data in the training set; if the text t does not contain an entity belonging to the category s, taking the category s and a corresponding entity empty set e ' as a label y ', and taking the text t and the label y ' as a piece of data in a training set;

dividing the data of the training set into a plurality of batches, extracting one batch of data from the training set each time, and generating a true value of an adjacency matrix of the batch of data by using an entity set in a label for each extracted batch of data;

for each piece of data, processing the text in the piece of data and the entity category in the label by using the text preprocessing process in the step 1 to generate an input sequence;

4. The method for generic end-to-end named entity recognition according to claim 1, wherein the prediction is performed using a model M, specifically:

selecting an entity category to determine an entity category to search in the text;