CN114239584A

CN114239584A - Named entity identification method based on self-supervision learning

Info

Publication number: CN114239584A
Application number: CN202111539122.4A
Authority: CN
Inventors: 周仁杰; 胡强; 万健; 张纪林; 殷昱煜; 蒋从锋
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-25

Abstract

The invention discloses a named entity identification method based on self-supervision learning, which comprises the following steps: preprocessing a data set, constructing a positive example sentence pair and a negative example sentence pair by using the processed data set, and respectively encoding sentences in the positive example sentence pair and the negative example sentence pair by using an embedding encoder; different definitions of entities in different contexts are learned according to different entity features and similarity matrixes in a named entity recognition model based on self-supervision learning, feature vectors of the named entity recognition model are fully learned according to the similarity of positive example sentence pairs and negative example sentence pairs, and the language difference of different corpora is met. The invention improves the accuracy of named entity recognition, and solves the problem of entity type recognition error caused by word abbreviation in the output result through the knowledge map, thereby more accurately predicting the entity and the entity type, and leading the word embedding vector with the ambiguous word to better represent the paraphrase of the word in the current context.

Description

Named entity identification method based on self-supervision learning

Technical Field

The invention relates to a named entity identification method, in particular to a named entity identification method based on self-supervision learning.

Background

The big data era comes, and research on named entity recognition gradually becomes one of the advanced fields of interdiscipline of cognitive science, information science and intelligent science and international emerging. In recent years, the western developed countries attach increasing importance to named entity identification, and open source information extraction becomes one of the important bases for national defense policy, strategic decision and command operation of each country. Named entity recognition is also rapidly becoming one of the international leading hot spots in the field of informatics in academia.

Existing named entity recognition methods mostly extract entities and entity types based on text. The main task of named entity recognition is to recognize and classify proper nouns such as names of people and places and numerical phrases such as meaningful time and date in the text. There are three main methods for named entity identification: rule-based methods, statistical-based methods, and supervised learning-based methods.

The rule-based method mainly extracts entities in the text through text rules by pre-constructing some special rules. The rule-based method has higher accuracy in some specific fields, but also causes great limitation, such as poor cross-domain portability, because the rule-based method has higher accuracy only in some specific fields; the statistics-based method mainly performs statistics on text information and excavates word features from a text corpus. The method based on statistics has higher requirements on the corpus, but the general corpus suitable for the evaluation of the large named entity recognition task is less at present, so the development of the method is limited to a certain extent; the method based on supervised learning mainly obtains a classifier from training data through training, applies the classifier to new entity recognition, solves the limitation of a rule-based method in a specific field to a certain extent and solves the problem of high requirement on a general corpus to a certain extent, but does not well learn the expression of an ambiguous word in the current context in a word embedding stage.

The invention further learns the ambiguous words by utilizing the self-supervision learning, provides a named entity recognition method based on the self-supervision learning and constructs a complete named entity recognition model.

Disclosure of Invention

The invention aims to solve the problem that the existing named entity recognition technology does not well learn the paraphrase of an ambiguous word in the current context in the word embedding stage, and provides a named entity recognition method based on self-supervised learning.

The technical scheme adopted by the invention is as follows:

step 1: preprocessing the data set;

1-1, forming words and conjunctions with entity types marked in a data set into sentences;

1-2 the sentence s of step 1-1_iTranslating into sentence a in arbitrary languages_iThen sentence a is repeated_iBy a and_itranslation into regular sentence in same language

Step 2: constructing positive and negative example sentence pair sets of the sentences processed in the step 1, wherein the positive example sentence pair sets are

Set composition, negative example sentence pair set

The negative example sentence pair consists of an original sentence and sentences translated from other sentences in the corpus;

and step 3: respectively carrying out embedding coding on the sentences in the positive example sentence pairs and the negative example sentence pairs by using an embedding coder;

and 4, step 4: embedding the embeded words into vectors and inputting the embedded vectors into a deep neural network layer DNN;

and 5: performing similarity calculation on the output vectors of the positive example sentence pairs and the negative example sentence pairs obtained in the step 4, and splicing the calculation results into a brand-new similarity matrix M according to rows_sim(ii) a And optimizing the embedding encoder f in the step 3 by using a contrast loss function l through back propagation and a gradient descent algorithm_kThe parameter (1) of (1);

step 6: obtaining sentences formed by words of the marked entity types, constructing a data set, and further dividing the data set into a training set and a testing set;

and 7: establishing a named entity recognition model based on self-supervision learning, wherein the named entity recognition model comprises a main network and a correction module which are sequentially cascaded; then, training the main network by using a training set, testing the trained main network by using a testing set, and finally correcting the output result of the tested main network by using a correction module;

the main network comprises an optimized embedding encoder f in the step 5_kA bidirectional LSTM layer and a CRF layer;

the correction module comprises a phrase retrieval module and an entity type modification module; the phrase retrieval module is used for acquiring a potential entity set of the main network input item and screening out the potential entity set existing in the public knowledge graph

And then combining the potential entity withAnd constructing the entity type into a potential entity set PE; the potential entity set comprises words, phrases formed by a plurality of words, and entity types corresponding to the words and the phrases; the entity type modification module is used for receiving the potential entity set PE output by the phrase retrieval module and the entity type label output by the main network, then comparing the entity type output by the main network with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, if the entity type output by the main network is consistent with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, the entity type modification module does not need to modify, and if the entity type output by the main network is inconsistent with the entity type output by the main network input item in the potential entity set PE, the entity type modification module modifies the output result of the main network;

and 8: and the named entity recognition of the text is realized by utilizing a tested named entity recognition model based on self-supervision learning.

It is a further object of the present invention to provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the above-mentioned method.

It is a further object of the present invention to provide a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method described above.

The technical scheme provided by the invention has the following beneficial effects:

according to the invention, positive example sentence pairs and negative example sentence pairs are constructed by translating sentences in multiple intermediate languages, the negative example sentence pairs are stored in a queue in practical use, the data of the current mini-batch enters the queue, the oldest mini-batch data in the past is shifted out of the queue, and the queue is used for decoupling the size of the queue and the batch size, namely the size of the queue is not limited by the constraint of the batch size, so that the problem that a large amount of negative example mini-batch data is needed in self-supervision learning is well solved;

the similarity of word embedding vectors in sentences in a vector representation space is measured by using the similarity function, and the parameters of the embedding encoder are slowly updated in a momentum moving average mode, so that the loss of feature consistency caused by the drastic change of the parameters of the embedding encoder can be avoided, the embedding encoder can be kept in an updated state all the time, and the embedding encoder can better accord with the definition of ambiguous words in the current context when the ambiguous words are encoded in a word embedding encoding stage by using the similarity function and the momentum moving average mode;

the invention solves the problem of entity type recognition error caused by word abbreviation in the output result by disclosing the knowledge map, and further improves the accuracy rate of entity recognition.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of an embedding encoder optimization for self-supervised learning;

FIG. 3 is a diagram of a named entity recognition model architecture based on self-supervised learning according to the present invention;

FIG. 4 is a diagram of a correction module in the named entity recognition model based on self-supervised learning according to the present invention;

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The specific flow description is shown in fig. 1, wherein:

step 1: preprocessing the data set;

the entity is proper nouns such as a name of a person, a place name, a mechanism name and the like in the text;

Set composition, negative example sentence pair set

and step 3: the method comprises the following steps of using an embedding encoder to respectively conduct embedding encoding on sentences in a positive example sentence pair and a negative example sentence pair, wherein the embedding encoding specifically comprises the following steps:

sentence s_iInput to the embedding encoder f_q(query-encoder) carries out word embedding encoding and obtains an encoded result q_i(ii) a At the same time, the sentence s_iCorresponding positive and negative example sentences

Input to the embedding encoder f_k(key-encoder) carries out word embedding encoding and obtains the encoded result

The embedding encoder f_q、f_kInitialization parameter theta of_q、θ_kThe same;

and 4, step 4: inputting the imbedding coded word embedding vector to a deep neural network layer (DNN);

the deep neural network layer includes a first fully-connected layer, a Relu layer, and a second fully-connected layer.

(1) First fully-connected layer: converting an embedding vector output by an unaptimized embedding encoder into an output vector with the same dimensionality through one layer of linear change;

o_dense1＝Wx_input+b

wherein o is_dense1Representing the output vector, x_inputRepresenting an embedding vector output by an unaptimized embedding encoder, wherein W represents a weight matrix, and b represents a bias vector;

(2) relu layer: the convergence speed of the model can be maintained in a stable state by inputting the output vector of the first full-connection layer into a Relu activation function;

o_dense2＝max(o_dense1,0)

wherein o is_dense2An output vector representing the Relu layer;

(3) second full connection layer: converting the output vector of the Relu layer into an output vector with the same dimensionality as the number of the types of the predicted entities;

and 5: performing similarity calculation on the output vectors of the positive example sentence pairs and the negative example sentence pairs obtained in the step 4, and splicing the calculation results into a brand-new similarity matrix M according to rows_sim(ii) a And optimizing the embedding encoder f in the step 3 by using a contrast loss function l through back propagation and a gradient descent algorithm_kThe parameter (1). The specific operation is as follows:

5-1 vector of outputs of DNN

Similarity calculation is carried out through a similarity function sim (·), and the normal case similarity of similar sentences is obtained

Negative example similarity of dissimilar sentences

Then r is⁺And r^-Aggregating according to rows to obtain a similarity matrix M_sim：

5-2 the similarity of the positive and negative example sentence pairs in the vector representation space is measured using the following comparison loss function l:

wherein tau is a hyper-parameter, the function of tau is to adjust the similarity to an input magnitude of a conformity function, exp (-) represents an exponential function with a natural constant e as a base, and sum (-) represents that matrix elements are added in rows;

5-3 optimization of the embedding encoder f by the contrast loss function l through back propagation, gradient descent algorithm_kThe parameter (1) of (1);

wherein f is_kUpdating theta by using momentum moving average mode_kAnd deposit the pass f using a queue_kThe encoded mini-batch data (key) is stored in the queue as the current sentence s_iNegative example sentence pair

The data of the current mini-batch will enter the queue, the data of the earliest past mini-batch will be removed from the queue, and the momentum moving average mode is as follows:

θ_k←mθ_k+(1-m)θ_q

wherein m is momentum;

and 7: building a named entity recognition model based on self-supervision learning, wherein the named entity recognition model comprises a main network and a correction module which are sequentially cascaded as shown in FIG. 3; then, training the main network by using a training set, testing the trained main network by using a testing set, and finally correcting the output result of the tested main network by using a correction module;

a) the main network comprises an optimized embedding encoder f in the step 5_kA bidirectional LSTM layer and a CRF layer;

1) optimized embedding encoder f_kUsed for encoding each word in the sentence into a word embedding vector; inputting complete sentences and outputting word embedding vectors of each word in the sentences;

2) the bidirectional LSTM layer is used for learning dependency information among words; the input is a word embedding vector, and the output is a word embedding vector containing dependency information among words;

LSTM has only one hidden state h compared with RNN_tOne more cell state c for LSTM_tThe hidden state and the cell state can store all valid information at and before time t. The LSTM implements protection and control of information through three gate control units, which are an input gate, a forgetting gate, and an output gate, respectively. The first step of LSTM is to discard some information reserved in the long sequence training process, and the step is completed by a forgetting gate which reads h_t-1And x_tAnd outputting a value between 0 and 1 by the sigmod activation function, wherein 0 represents complete abandoning, 1 represents complete reserving, and the forgetting gate is calculated as follows:

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

wherein x_tIs an embedding encoder f_kOutput word-embedded vector, h_t-1Hidden state of LSTM at time t-1, W_fAnd U_fAre respectively a forgetting door middle h_t-1And x_tWeight matrix of b_fTo forget the offset vector of the gate, σ (-) denotes the sigmod activation function, f_tIs the output of the forgetting gate;

the second step is to update the cell state c_t. In updating c_tPreviously it was necessary to determine which information needs to be updated by means of the input gate and to determine the candidate update content (candidate value vector z) by means of a tanh layer. The calculation mode through the input gate and the tanh layer is similar to that of the forgetting gate, and the calculation mode is as follows:

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

z＝tanh(W_zh_t-1+U_zx_t+b_z)

wherein W_iAnd U_iAre respectively an input gate h_t-1And x_tWeight matrix of b_iIs the offset vector of the input gate,i_tis the output of the input gate; w_zAnd U_zRespectively h in the candidate value vector_t-1And x_tWeight matrix of b_zAn offset vector that is a vector of candidate values;

cell state c is then updated by matrix dot product_t：

c_t＝f_t⊙c_t-1+i_t⊙z

Wherein |, indicates a dot product operation of the matrix;

the last step through LSTM is to update the hidden state h_t. Update h_tThe cell state c needs to be determined_tThe state h is updated by processing the tanh layer to obtain a value between-1 and 1, and multiplying the value by the output point of the output gate_tThe output gate is similar to the forgetting gate and the input gate in calculation mode.

o_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t⊙tanh(c_t)

Wherein W_oAnd U_oRespectively in the output gate_t-1And x_tWeight matrix of b_oIs an offset vector of the output gate, o_tIs the output of the output gate;

for many sequence tagging tasks, it makes sense to access both past and future information, whereas the hidden state of the one-way LSTM can only obtain information from the past. In order to obtain both past and future information, a bi-directional LSTM is used. The output of the bi-directional LSTM is the score of each token belonging to each class of tags, which needs to be normalized by softmax after calculation:

wherein gamma is_iNormalized result of the label score representing the ith token, x_iA label score vector representing the ith token, n being the size of the label category;

3) a CRF layer for further correcting the recognition result; inputting an output vector of a bidirectional LSTM layer, and outputting an entity label of each word;

CRF is a typical discriminant model that functions to further correct recognition results. For the named entity recognition task, some meaningless characters may exist in the output result, and the model does not consider the dependency relationship between the labels. The CRF can reasonably combine the context information to extract the dependency relationship between the labels, so that the identified entity meets the labeling rule;

in CRF, there are two very important scores, namely an emulsion score and a Transition score. Wherein the emision score is derived from the output of the bi-directional LSTM model, specifically predicting for each token a score for each class of labels; while Transition score is the probability of Transition from a certain class of tags to another, the Transition matrix is the Transition probability that can be trained to change the internal tags. With the occurrence score and the Transition score, the Path score Path of the current output sequence can be calculated, as shown in the formula:

T_i,j＝em_i+trans_i,j

em therein_iAnd trans_i,jAn emulation score of the ith token and a Transition score, T, of the label transferred from the ith token to the jth token in a sentence, respectively_i,jThe sum of the number of the emulation score and the Transition score in a sentence. CRF training is performed by:

wherein, Path_realPath score for the correct Path in the training process, Path_iFor the path score of the ith possible path, loss represents the loss function of the CRF layer;

b) the correction module comprises a phrase retrieval module and an entity type modification module, as shown in FIG. 4;

1) a phrase retrieval module for obtaining the potential entity set of the main network input item and screening out the public knowledge map

The potential entity of (a); the potential entity set comprises words and phrases formed by a plurality of words;

the input is a sentence, and the output is a potential entity set PE in the sentence; the specific steps of the retrieval are as follows:

i. finding out The permutation and combination of all word group synthesis phrases in The sentence, for example, The sentence "The European Commission" can obtain The set Pe ═ The, { The, European, Commission, The European, European Commission, The European Commission };

inputting each potential entity in the set Pe obtained in the step i into the public knowledge graph

If the potential entity and the entity type corresponding to the potential entity can be retrieved from the public knowledge graph, adding the potential entity and the entity type into a potential entity set PE;

the set of potential entities PE, for example { The European Commission: Organization, … };

2) the entity type modification module is used for receiving the potential entity set PE output by the phrase retrieval module and the entity type labels output by the main network, then comparing the entity type labels output by the main network with the entity types corresponding to all potential entities in the main network input items in the potential entity set PE, if the entity types are consistent, the modification is not needed, and if the entity types are not consistent, the output result of the main network is modified;

and 8: and the named entity recognition of the text is realized by utilizing a tested named entity recognition Model (MBBCD) based on self-supervised learning.

The performance evaluation of the invention adopts a Conll2003 English public data set, and the following table shows the data volume condition of the data set:

	number of articles	Number of sentences	Number of words
				Training set	946	14987	203621
Development set	216	3466	51362
				Test set	231	3684	46435

The data set comprises four entity types, namely place name, person name, organization name and other entities, and the entity labeling method adopts a BIO labeling method: the BIO notation specifies that all named entities begin with the B label, I denotes inside the named entity, O denotes outside the named entity, e.g., if a word in the corpus is labeled B/I-XXX, B/I denotes that the word belongs to the beginning or inside of the named entity, i.e., the word is part of the named entity, and XXX denotes the type of the named entity. The following table shows the specific distribution of the number of entities in the training set, development set, and test set in the data set:

	place name	Name of a person	Organization name	Other entities
					Training set	7140	6600	6321	3438
Development set	1837	1842	1341	922
					Test set	1668	1617	1661	702

In step 7, the DBpedia English knowledge graph is adopted to correct the entity type recognition caused by the word abbreviation in the output result, and the following table is the entity recognition result of the invention on the test set:

in the entity recognition result table, CNN is used for character-level coding, Glove is used for providing pre-trained word vectors, and a named entity recognition Model (MBBCD) based on self-supervised learning is the named entity recognition method based on self-supervised learning provided by the invention. Precision, Recall and Micro-F1 are adopted as performance evaluation indexes of entity identification in the experiment. The marking method in the named entity simultaneously determines the entity boundary and the entity type, and only when the entity boundary and the entity type are simultaneously and accurately marked, the identification result of the current entity is correct. Based on the data True Posives (TP), False Posives (FP) and False Negatives (FN), the Precision rate (Precision), Recall rate (Recall) and F1 value (F1-score) of the named entity recognition task can be calculated. TP is defined to correctly identify entity boundaries and entity types, FP is defined to correctly identify entities but entity boundaries or entity types are misjudged, and FN is defined to be entities that should be identified but are not actually identified.

According to the definition of Precision: for a given data set, the accuracy rate is the ratio of the number of samples with correct classification to the total number of samples, and the calculation mode of the accuracy rate in the named entity recognition task can be obtained:

according to the definition of Recall rate Recall: the recall ratio is used for explaining the calculation mode of the recall ratio in the named entity recognition task which is available in the ratio of the positive cases to the total proportion and judged to be true in the classifier:

according to the definition of the Micro-average F1 value Micro-F1: the Micro-F1 value is a harmonic average index of the precision rate and the recall rate and is a comprehensive index for balancing the influence of the precision rate and the recall rate. Therefore, the calculation mode of the Micro-F1 value can be obtained:

Claims

1. a named entity recognition method based on self-supervision learning is characterized by comprising the following steps:

step 1: preprocessing the data set;

Set composition, negative example sentence pair set

Then constructing the potential entity and the entity type into a potential entity set PE; the potential entities comprise words and phrases formed by a plurality of words; the entity type modification module is used for receiving the potential entity set PE output by the phrase retrieval module and the entity type label output by the main network, then comparing the entity type output by the main network with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, if the entity type output by the main network is consistent with the entity type corresponding to each potential entity in the main network input item in the potential entity set PE, the entity type modification module does not need to modify, and if the entity type output by the main network is inconsistent with the entity type output by the main network input item in the potential entity set PE, the entity type modification module modifies the output result of the main network;

2. The method as claimed in claim 1, wherein the embedding encoder f is a named entity recognition method based on self-supervised learning_q、f_kInitialization parameter theta of_q、θ_kThe same is true.

3. The named entity recognition method based on self-supervised learning as recited in claim 1, wherein the step 3 specifically comprises:

sentence s_iInput to the embedding encoder f_qPerforming word embedding coding and obtaining a coded result q_i(ii) a At the same time, the sentence s_iCorresponding positive and negative example sentences

Input to the embedding encoder f_kPerforming word embedding coding and obtaining a coded result

4. The named entity recognition method based on self-supervised learning as recited in claim 1, wherein the deep neural network layer comprises a first fully-connected layer, a Relu layer and a second fully-connected layer;

(1) first fully-connected layer: will embed encoder f_q、f_kConverting the output embedding vector into an output vector with the same dimensionality through one layer of linear change;

o_dense1＝Wx_input+b

(2) relu layer: inputting the output vector of the first full-connection layer into a Relu activation function to maintain the convergence speed of the model in a stable state;

o_dense2＝max(o_dense1，0)

wherein o is_dense2An output vector representing the Relu layer;

(3) second full connection layer: and converting the output vector of the Relu layer into an output vector with the same dimension as the number of types of the predicted entity.

5. The named entity recognition method based on self-supervised learning as recited in claim 1, wherein the specific operations in step 5 are as follows:

5-1 vector of outputs of DNN

Negative example similarity of dissimilar sentences

wherein tau is a hyper-parameter, exp (-) represents an exponential function with a natural constant e as a base, and sum (-) represents the addition of matrix elements in rows;

5-3 optimization of the embedding encoder f by the contrast loss function l through back propagation, gradient descent algorithm_kThe parameter (1).

6. The method as claimed in claim 1, wherein the embedding encoder f is a named entity recognition method based on self-supervised learning_kUpdating theta by using momentum moving average mode_kThe momentum moving average mode is as follows:

θ_k←mθ_k+(1-m)θ_q

where m is the momentum.

7. The method as claimed in claim 1, wherein the optimized embedding encoder f is a named entity recognition method based on self-supervised learning_kUsed for encoding each word in the sentence into a word embedding vector; inputting complete sentences and outputting word embedding vectors of each word in the sentences;

the bidirectional LSTM layer is used for learning dependency information among words; the input is a word embedding vector, and the output is a word embedding vector containing dependency information among words;

the CRF layer is used for further correcting the recognition result; the input is the output vector of the bi-directional LSTM layer and the output is the entity type label for each word.

8. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.

9. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-7.