CN115809666B

CN115809666B - Named entity recognition method integrating dictionary information and attention mechanism

Info

Publication number: CN115809666B
Application number: CN202211546653.0A
Authority: CN
Inventors: 姜明; 陈跃晨; 张旻
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-08-08
Anticipated expiration: 2042-12-05
Also published as: CN115809666A

Abstract

The invention discloses a named entity recognition method integrating dictionary information and an attention mechanism, which comprises the following steps: embedding a Bert word, and performing LSTM context fusion to obtain character features and sequences thereof; step (2) obtaining a relation feature matrix V and a distance matrix E by processing the character feature sequence _d Regional network matrix E _t And vocabulary grid matrix E _w Splicing the characteristics of the four matrixes to obtain grid relation characteristics; step (3), carrying out global feature fusion on the grid relation features in the step (2) through a global attention mechanism to obtain Word-Pair features fused with global information; step (4) carrying out joint prediction on the Word-Pair characteristics and the character characteristics to obtain a Word-Pair relation matrix; and (5) decoding the Word-Pair relation matrix to obtain a final entity Word and the type thereof. According to the method, word-level information is expressed as a Word-Pair Word embedding relation matrix which is integrated into a model, and the recognition accuracy of entity boundaries is improved.

Description

Named entity recognition method integrating dictionary information and attention mechanism

Technical Field

The invention relates to the technical field of named entity recognition in the field of information extraction, in particular to a named entity recognition method integrating dictionary information and an attention mechanism, which can effectively extract named entities from texts.

Background

Named entity recognition is a classical text information extraction task, and aims to extract entity information with specific meaning from unstructured text, including proper nouns such as name, place name and organization name, or numerical information such as amount, time and quantity, so that people can obtain needed information more efficiently.

At present, the mainstream named entity recognition method includes an unsupervised method based on dictionary or rule, a supervised method based on sequence labeling, a supervised method based on reading and understanding, and the like. The matching method based on the entity dictionary is the method which is the fastest and the most widely used, has high efficiency, but has the effect greatly influenced by the integrity of the entity word list, and has higher accuracy under the general condition and lower recall rate. The core of the rule template-based matching mode is to sum up rule templates, one rule template of entity words, and one grammar template of entity words, and sum up missing places. A supervision method is provided later, and mainly comprises a method based on sequence labeling and a method based on reading understanding, wherein the method based on sequence labeling can better identify continuous entities in a text, but can not identify nested entities and discontinuous entities. There are also many methods for fusing dictionary information in sequence-based labeling for enhanced recognition of entity boundaries. Based on the reading and understanding method, all entity answers are not given at one time, but corresponding entities are given according to the questions, and unlike the traditional entity identification method based on sequence labeling, the method based on machine reading and understanding can effectively solve the identification problems of overlapped entities and non-continuous entities, but has great uncertainty in the practical application process.

Later, people proposed W ² NER, a named entity recognition method for unified modeling of Word-Pair relationships, can be used to identify continuous, nested, and non-continuous entities. In the method, two word-word relationships are defined, one is an adjacent relationship, which means that two words can be adjacent, and the other is an entity relationship, which means that two words respectively serve as the head and the tail of an entity. The method can be divided into three steps, namely, firstly, coding an input sentence by using Bert and Bi-LSTM to obtain a word embedding sequence, secondly, obtaining a characteristic representation matrix of the relation between words by using a Condition normalization layer (Condition LayerNormalization, CLN), connecting the matrix with a distance matrix and an upper triangular matrix, and capturing interactions between words with different distances by using a multi-layer perceptron and cavity convolution to obtain a grid representation of word pairs; 3. and predicting the relation between two words by using two classifiers, namely a multi-layer perceptron and a double affine predictor, and then decoding the word relation of the words to obtain the named entity in the text. But only character-level information is used in the method, word information is ignored, and the cavity is convolvedInformation is lost while the receptive field is increased, global feature extraction is not facilitated, and the relation between partial word pairs is weakened.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art and integrating word-level information into W by a method for fusing dictionary information ² In the NER model, the boundary recognition accuracy of named entity recognition is improved. At the same time, the Criss-cross-section module is used to capture the interaction between short-distance and long-distance words, and the attention mechanism is used to further encode the grid representation to solve the original W ² Hole convolution in the NER model has the problem of information loss.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

embedding a Bert word, and performing LSTM context fusion to obtain character features and sequences thereof;

step (2) obtaining a relation feature matrix G and a distance matrix E by processing the character feature sequence in the step (1) _d Regional network matrix E _t And vocabulary grid matrix E _w Splicing the characteristics of the four matrixes to obtain grid relation characteristics;

step (3), carrying out global feature fusion on the grid relation features in the step (2) through a global attention mechanism to obtain Word-Pair features fused with global information;

step (4) carrying out joint prediction by utilizing the Word-Pair characteristics obtained in the step (3) and the character characteristics obtained in the step (1) to obtain a Word-Pair relation matrix;

and (5) decoding the Word-Pair relation matrix in the step (4) to obtain a final entity Word and the type thereof.

The beneficial effects of the invention are as follows:

the invention provides a named entity recognition method integrating dictionary information and attention mechanism, which is characterized in that Word-level information is expressed as Word-Pair Word embedded relation matrix and is integrated into a model, so that the recognition accuracy of entity boundaries is improved, and simultaneously, the REcurrent Criss-cross recognition is applied to capturing short-distance and long-distance wordsAnd interaction between the two devices improves the extraction capability of global features. The result shows that the method has better robustness and self-adaption capability. Testing is carried out on a Resume Chinese data set, and the accuracy, recall rate and F1 value are respectively as follows: 97.01%, 96.56% and 96.78%. Testing is carried out on Conll03 English data sets, and the accuracy rate, recall rate and F1 value are respectively as follows: 92.88%, 93.59% and 0.9323%. Both results are compared with the original W ² The NER model has good effect and belongs to the leading level in the field.

Drawings

FIG. 1 is a flow chart of an overall embodiment of the invention

FIG. 2 is a schematic diagram of a word embedding relationship matrix of the present invention

FIG. 3 is a schematic diagram of the Criss-cross-section structure of the present invention

FIG. 4 is a word pair relationship diagram of the present invention

FIG. 5 is a schematic view of the overall structure of the present invention

Detailed description of the preferred embodiments

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, which is a flowchart of an overall embodiment of the present invention, a named entity recognition method integrating dictionary information and an attention mechanism includes the steps of:

step (1) performing word embedding conversion on each character in an input sentence by using Bert to obtain embedded representation of character level, and then inputting the embedded representation into Bi-LSTM to fuse global context information to obtain final character characteristics and sequences thereof;

step (2) representing the character feature sequence of step (1) as a word-to-grid relation feature matrix G by using CLN, and simultaneously calculating the distance between each character to represent as a grid matrix E _d And an upper and lower triangular area matrix E _t Word embedding relation matrix E is constructed by utilizing dictionary to match words in an input sentence, finding out embedded representations of the words in a pre-training word embedding model _w And splicing the characteristics of the four grid matrixes.

Step (3) inputting the grid relation features of the step (2) into a current Criss-cross section module, and fusing global features by using a global attention mechanism;

step (4) utilizing Word-Pair characteristics fused with global information in the step (3) to jointly predict Word-Pair relation matrix by character characteristics in the step (1);

Further, the specific implementation process of the step (1) is as follows:

performing word embedding conversion on each character in an input sentence by using Bert to obtain an embedded representation of a character level; for an input sentence x= { X ₁ ,x ₂ ,…,x _N }∈R ^N For each character x thereof, bert is used _i Coding, and then performing context coding through bidirectional LSTM to obtain

For example: 9 characters are included in the input sentence of urethra, bladder and renal colic, and the characteristic representation of the 9 characters is obtained after Bert and Bi-LSTM.

Further, the specific implementation process of the step (2) is as follows:

2-1, processing the character feature sequence H by using CLN condition normalization to obtain a relation feature matrix of word pairs wherein G_ij The calculation formula of (2) is as follows:

G _ij is h _i Regarding h _j Wherein γ is represented by _ij ＝W _α h _i +b _α ，λ _ij ＝W _β h _i +b _β Training to obtain μ and σ as h _i Mean and standard deviation of (a):

after this step, a characteristic representation between two characters in the sentence "urinary tract, bladder, renal colic" can be obtained.

2-2 calculation of distance between characters expressed as a grid matrixFor an input sentence X, two words (X _i ,x _j ) The distance between the two words is expressed as the absolute distance i-j, and the two words pass through an embedding layer which is a conventional linear neural network to obtain a distributed representation of the distance.

For example: in the sentence "urinary tract, bladder, renal colic" the absolute distance between "urine" and "tract" is 1 and the distance between "tract" and "pain" is 7.

2-3 of the area network matrixBy manual generation, an n×n matrix is set, for the upper triangular region of the matrix, all values are set to 1, the lower triangular region is set to 2, and then an embedded layer is passed to obtain a distributed representation of the upper and lower triangular regions.

Because the relation label types of the upper triangle area and the lower triangle area are different in the output Word-Pair relation matrix of the model, the area network matrix is set for the purpose of enabling the model to learn the difference of the upper triangle area and the lower triangle area.

2-4 matching the words in the input sentence with the dictionary by constructing a dictionary tree of all words in the dictionary, traversing all words in the input sentence X, and matching all possible words from the dictionary tree, including both continuous words and non-continuous words. For example, to have a length of m wordsIn this wordIn the words, the relation between two words adjacent to each other is represented NNW, the relation between the last word and the first word is represented as this word, the embedded representation of these words and the special relation NNW is found in the pre-trained word embedding model and placed in the corresponding position in the matrix. Constructing and obtaining word embedding relation matrix->

For example: the dictionary is used for matching all possible words in the sentence of urethra, bladder and renal colic, so that words such as urethra, bladder, urethral pain, bladder pain, renal colic and the like can be obtained, and the words of the words are embedded into a word embedding relation matrix according to the method, so that the words shown in figure 2 are obtained.

2-5, splicing the characteristics of the four grid matrixes to obtain grid relation characteristics

Further, the specific implementation process of the step (3) is as follows:

inputting the grid relation feature C of the step into a current Criss-cross section module, and fusing global features by using a global attention mechanism;

the implementation manner of the Criss-cross attribute is also based on an Attention mechanism, as shown in fig. 3, firstly, the output C of the backbone network is subjected to convolution operation of three convolution modules to obtain three matrixes, namely Q (query), K (key) and V (value), wherein Q refers to the query, the content K corresponding to the decoder refers to the key, the content V corresponding to the encoder refers to the value, and the content corresponding to the encoder. Wherein the method comprises the steps ofd' _c Set to d _c Then, Q and K are calculated by Affinity operation to generate an attention matrix A;

for Affinity operations: at each position u in Q, we can be at d' _c The axis obtains a vector, and simultaneously, we can extract the vector in the same row and column with the position u from KThen the parameter of the i-th position is Ω _i,u . For the Affinity calculation formula:

the generated D is activated by Softmax to obtain A epsilon R ^{N×N×(N+N-1)} 。

For the generatedWe can also be at d for each position u _c On-axis a vector set is obtained>Multiplying the vector set with the generated A to complete the Aggregation operation, and finally adding the original input C to output the generated P'

P' _u ＝∑A _i,u Φ _i,u +P _u

In order to make each position u correspond to any position, this is done by calculating Criss-cross-section twice, i.e. the Recurrent Criss-cross-section, it is only necessary to calculate Criss-cross-section again for P' and output P ". At this time, there are:

P” _u ＝∑P' _i,u Φ _i,u +P' _u

obtaining the final product

Further, the specific implementation process of the step (4) is as follows:

and (3) predicting the Word-Pair relationship type by utilizing the Word-Pair characteristics fused with the global information in the step (3) and the character characteristics in the step (1).

Specifically, for Word-Pair feature matrix P, a multi-layer perceptron (MLP) is used to predict the relationship between each Pair of words. The MLP predictor (the word-pair grid representation based MLP predictor) is enhanced by cooperation with a dual affine predictor (Biaffifine Predictor) for relationship classification. Thus, we use both predictors to calculate word pairs (x _i ,x _j ) And combine them into a final prediction.

The Biaffifine Predictor performs the relation classification prediction on the output H of the step (1), which can be regarded as a residual connection, so that the model training is better due to the effects of preventing the model from degradation and relieving the gradient explosion and disappearance of the model. Given the word representation H, we calculate the subject (x) separately using two MLPs _i ) And object (x) _j ) Word representation s _i and o_j . Then, a pair of subject and object words (x) is calculated using a dual affine classifier _i ,x _j ) Relationship score y 'between' _ij ：

s _i ＝MLP ₁ (h _i )

s _j ＝MLP ₂ (h _j )

Where U, W and b are trainable parameters.

In addition, word-Pair feature matrix P is input into a multi-layer perceptron to calculate a Pair of subject and object words (x _i ,x _j ) Relationship score y "between" _ij ：

y” _ij ＝MLP(Q _ij )

Finally, adding the relation score of the MLP layer and the Biaffine relation score, and taking the label with the highest score as the final result of the joint prediction through Softmax:

y _ij ＝Softmax(y’ _ij +y” _ij )

a Word-Pair relationship matrix is obtained as shown in fig. 4.

Finally, a standard cross entropy is used as a loss function training model, and the formula is as follows:

wherein ,for correct label->To predict the probability of being a label R, R is a predefined set of all relationship labels.

Further, the specific implementation process of the step (5) is as follows:

and (3) decoding the Word-Pair relationship in the step (4) to obtain a final entity Word and the type thereof. The relation R of all word pairs is used as input, and the decoding goal is to find all the entity word index sequences and their corresponding categories. We construct a graph in which nodes are words and edges are NNW relationships. Then we find all paths from head Word to tail Word by depth first search algorithm, namely Word index sequence of corresponding entity, then take head index i and tail index j of the entity, and find THW-relation from Word-Pair grid [ j, i ] position, namely category of the entity.

This embodiment is embodied on the two named entity recognition datasets, resume and Conll-2003, respectively. Wherein, the Resume contains 4759 sentences, and the entity labels are divided into 8 categories of names, nationalities, through, race, profession, academic position, institutions, job titles and the like; conll-2003 contains 24197 sentences with entity labels divided into 4 categories, person, organization, location, miscellaneous, etc. After training and prediction of the above model, the following table shows the partial prediction results:

the result shows that the method has better robustness and self-adaption capability. Testing is carried out on a Resume Chinese data set, and the accuracy, recall rate and F1 value are respectively as follows: 97.01%, 96.56% and 96.78%. Testing is carried out on Conll03 English data sets, and the accuracy rate, recall rate and F1 value are respectively as follows: 92.88%, 93.59% and 0.9323%. Both results are compared with the original W ² The NER model has good effect and belongs to the leading level in the field.

Claims

1. A named entity recognition method integrating dictionary information and attention mechanisms is characterized by comprising the following steps:

step (2) obtaining a relation feature matrix G and a distance matrix E by processing the character feature sequence in the step (1) _d Regional network matrix E _t And vocabulary grid matrix E _w And splice the characteristics of the four matrixes to obtain grid relation characteristics,

the acquisition method of the relation characteristic matrix G comprises the following steps:

processing the character feature sequence H by using CLN condition normalization to obtain a relation feature matrix of word pairs wherein G_ij The calculation formula of (2) is as follows:

the distance matrix E _d The acquisition method of (1):

calculating the distance between each character feature to obtain grid matrixFor an input sentence X, two words (X _i ,x _j ) The distance between the two words is expressed as absolute distance I-j I, and then the two words pass through an embedding layer to obtain a grid matrix of distributed expression of the distance>

The area network matrixSetting a block of N multiplied by N matrix by manual generation, setting all values as 1 for an upper triangular area of the matrix, setting 2 for a lower triangular area of the matrix, and obtaining a distributed representation area network matrix of the upper triangular area and the lower triangular area through an embedded layer>

The vocabulary grid matrix E _w The acquisition method comprises the following steps:

matching words in the input sentence by using the dictionary, constructing a dictionary tree of all words in the dictionary, traversing all words in the input sentence X, matching all possible words from the dictionary tree, including continuous words and discontinuous words, constructing and obtaining word embedding relationMatrix array

Step (3) carrying out global feature fusion on the grid relation features in the step (2) through a global attention mechanism to obtain Word-Pair features fused with global information,

specifically, the grid relation feature C of the step is input to a current Criss-cross section module, and global features are fused by using a global attention mechanism;

the implementation mode of Criss-cross Attention is also based on an Attention mechanism, firstly, the output C of a main network is subjected to convolution operation of three convolution modules to respectively obtain three matrixes Q, K and V, whereind' _c Set to d _c Then, Q and K are calculated by Affinity operation to generate an attention matrix A;

for Affinity operations: each position u in Q can be at d' _c The axis obtains a vector, and simultaneously, we can extract the vector in the same row and column with the position u from KThen the parameter of the i-th position is Ω _i,u For the Affinity calculation formula:

the generated D is activated by Softmax to obtain A epsilon R ^{N×N×(N+N-1)} ，

For the generatedWe can also be at d for each position u _c On-axis obtaining a vector setMultiplying the vector set with the generated A to complete the Aggregation operation, and finally adding the original input C to output the generated P'

P′ _u ＝∑A _i，u Φ _i，u +P _u

In order to make each position u correspond to any position, this is done by calculating Criss-cross-section twice, i.e. the Criss-cross-section is calculated again for P' and the output P "is:

P″ _u ＝∑P _i，u Φ _i，u +P′ _u

obtaining a final Word-Pair feature matrix

Step (4) performs joint prediction by utilizing the Word-Pair characteristics obtained in the step (3) and the character characteristics obtained in the step (1) to obtain a Word-Pair relation matrix,

the specific implementation process is as follows:

for Word-Pair feature matrix P, the relationships between each Pair of words are predicted using a multi-layer perceptron, the MLP predictor is enhanced by cooperation with a dual affine predictor for relationship classification, while the two predictors are employed to calculate Word pairs (x _i ,x _j ) And combine them into a final prediction,

wherein Biaffine Predictor performs a relational classification prediction on the output H of step (1), considered as a residual connection, and calculates the subject (x) using two MLPs, respectively, given the word representation H _i ) And object (x) _j ) Word representation s _i and o_j Then, a pair of subject and object words (x) _i ,x _j ) Relationship score y 'between' _ij ：

s _i ＝MLP ₁ (h _i )

s _j ＝MLP ₂ (h _j )

Wherein U, W and b are trainable parameters;

in addition, word-Pair feature matrix P is input into a multi-layer perceptron to calculate a Pair of subject and object words (x _i ,x _j ) The relation score y _ij ：

y″ _ij ＝MLP(Q _ij )

y _ij ＝Softmax(y′ _ij +y″ _ij )

obtaining a Word-Pair relation matrix;

wherein ,for correct label->For predicting the probability of being a label R, R is a predefined set of all relationship labels;

2. The named entity recognition method integrating dictionary information and attention mechanisms according to claim 1, wherein the specific implementation process of the step (1) is as follows:

use Bert performing word embedding conversion on each character in the input sentence to obtain an embedded representation of a character level; for an input sentence x= { X ₁ ,x ₂ ,…,x _N }∈R ^N For each character x thereof, bert is used _i Coding, and then performing context coding through bidirectional LSTM to obtain character feature sequence

3. The named entity recognition method integrating dictionary information and attention mechanisms according to any one of claim 1, wherein the relation feature matrix G and the distance matrix E are obtained by _d Regional network matrix E _t And vocabulary grid matrix E _w The characteristics of the grid relation are obtained by splicing

4. The named entity recognition method integrating dictionary information and attention mechanisms according to claim 1, wherein the specific implementation process of the step (5) is as follows:

the relation R of all Word pairs is used as input, the aim of decoding is to find all entity Word index sequences and corresponding categories thereof, a graph is constructed, the graph nodes are words, the edges are NNW relations, a depth-first search algorithm is used for finding all paths from head words to tail words, namely the Word index sequences of the corresponding entities, then the head index i and the tail index j of the entities are taken, and then the THW-relation, namely the category of the entities, is found from Word-Pair grid [ j, i ].