CN115600597A

CN115600597A - Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium

Info

Publication number: CN115600597A
Application number: CN202211271734.4A
Authority: CN
Inventors: 王媛媛; 胡荣林; 董甜甜; 邱军林; 曹昆; 郭俊莹; 张海艳; 冯万利; 王忆雯
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2023-01-13

Abstract

The invention discloses a named entity identification method, a device, a system and a storage medium based on attention mechanism and in-word semantic fusion, wherein the method comprises the following steps: s1, inputting a sentence sequence into a sub-word fitter to be matched with sub-word embedding information; s2, inputting the matched sub-word embedded information into a CNN semantic network to extract the internal semantic features of the sub-words; s3, obtaining a word-level text representation by using a CHINESE-BERT model, and inputting the word-level text representation into the global context characteristics of the learning sentence in the BI-LSTM network; s4, inputting the obtained internal semantic features and the global context features of the sub-words into a WordFusionAttention module, and extracting key context features after the internal features of the words are fused; and S5, inputting the fused key context characteristics into a CRF decoder to predict an entity label. Compared with the prior art, the method provided by the invention can effectively improve the recognition precision of the named entity and relieve the problem of difficult recognition of the unknown words.

Description

Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium

Technical Field

The invention relates to the technical field of Chinese named entity recognition of computer natural language processing, in particular to a named entity recognition method, a named entity recognition device, a named entity recognition system and a storage medium based on attention mechanism and intra-word semantic fusion.

Background

Named entity recognition is a popular research direction in the field of natural language processing, and its technology has achieved competitive results in the general field, but its application in chinese text still has huge problems and challenges.

With the wide application of deep learning, a deep neural network-based model is commonly adopted in the named entity recognition task. The method is to use a deep neural network to extract text features by taking words or word vectors in sentences as input. The Bi-LSTM + CRF model as proposed by Huang et al: huang Z, wei X, kai Y. Bidirectional LSTM-CRF Models for Sequence Tagging [ J ] Computer Science 2015. The phenomenon of word ambiguity frequently occurring in Chinese texts can cause poor recognition effect, the accuracy of the model is effectively improved until the BERT pre-training model appears, semantic information in sentences can be well represented, and therefore the problem of word ambiguity is solved. However, these methods are trained based on a large amount of corpus and a large amount of labeled data, and at present, there are few studies on the recognition of chinese text entities in some professional fields at home and abroad, such as the chemical field, and it is difficult to improve the recognition performance in the case of facing a small number of samples, and the reasons mainly include the following two points:

first, the existing pre-training model-based general named entity recognition uses a large-scale training corpus, which contains a plurality of domain knowledge, so that the model lacks attention to professional vocabularies in a specific domain, and imbalance of various domain samples can lead to poor entity type recognition effect of the general model in the professional domain.

Secondly, entities in the professional field are different from general entities, vocabularies in the professional field have the characteristics of complex naming rules and frequent appearance of new words, and the existing model based on sequence labeling has the problem that internal composition information and unknown words of important words in sentences are ignored and are difficult to identify.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems pointed out in the background technology, the invention discloses a named entity recognition method, a named entity recognition device, a named entity recognition system and a named entity storage medium based on attention mechanism and intra-word semantic fusion.

The technical scheme is as follows: the invention discloses a named entity identification method based on attention mechanism and in-word semantic fusion, which comprises the following steps:

s1, dividing text data to obtain a text sequence with sentences as units;

s2, inputting the text sequence obtained in the step S1 into a sub-word fitter to obtain sub-word representations of words in the text;

s3, local semantic information in the sub-word representation obtained in the step S2 is learned through the CNN semantic network, and the internal features of the sub-words are extracted;

s4, extracting character-level text representation of the text sequence in the step S1 by using a CHINESE-BERT model;

s5, learning context long-distance information from the character-level text representation in the step S4 by using Bi-LSTM, and extracting global context semantic features;

s6, inputting the internal features of the subwords in the S3 and the global context semantic features in the S5 into a WordFusionAttention module to obtain key context features;

and S7, inputting the key context characteristics obtained in the step S6 into a CRF decoder, learning the internal characteristic constraint of the text, and obtaining the label of entity identification.

Further, the sub-word adaptorator in step S2 matches the existing words in the thesaurus from the text, concatenates the sub-word representations at the beginning of the same character to form the sub-word representations in the text, and specifically includes the following steps:

s2.1, constructing a word stock into a dictionary tree T, wherein each node in the dictionary tree stores each Chinese character of a word, a root node stores a first Chinese character of the word, a back pointer of the node points to a next Chinese character of the word, and a front pointer of the node points to a previous Chinese character of the word;

s2.2, traversing each character in the text sequence obtained in the step S1, searching by using the dictionary tree T in the step S2.1, and obtaining a word set W corresponding to each word by taking each word in the input sequence S as a word at the beginning _i,j I belongs to n, j belongs to l; wherein, W _i,j The method comprises the steps of representing a j-th word which is matched and begins with the ith character in a sentence, and l representing the number of the matched words which begin with the ith character in the sentence; writing a none value into a set which is not matched with the words;

s2.3, inputting the sub-words in the word set obtained in the step S2.2 into a CHINESE-BERT model to obtain a sub-word matrix based on word embedding

Filling and zero matrix splicing are carried out on the sub-word matrix corresponding to each character to obtain spatial information W embedded with the three-dimensional sub-words _i The first dimension is the number of each word corresponding to a sub-word, the second dimension is the length of each sub-word, and the third dimension is the vector dimension of the word in the sub-word:

CW _i ^j ＝e(W _i,j )

W _i ＝Ex(CW _i ^j )

wherein e (-) represents the loaded CHINESE-BERT model; ex (-) denotes the padding and splicing operation.

Further, the CNN semantic network in step S3 includes two convolutional layers and one pooling layer, where the size of the convolutional kernel in the first convolutional layer is 9 × 9; the convolution kernel size in the second convolution layer is 3 × 3; a max pooling operation behind the first convolutional layer, with a window of 1 x 3; inputting the sub-word embedded vector into a first convolution layer to obtain shallow semantic features inside words; the shallow semantic feature vector is down-sampled through a pooling layer to obtain a semantic feature vector; and extracting deep semantic features inside the words from the semantic feature vectors through a second convolution layer to obtain the deep semantic features.

Further, in step S6, the wordfusion attention module calculates similarity between the context of the sentence and the feature of the subword through dot product operation to dynamically weight the sentence, which specifically includes the following steps:

step 4.1, the global context feature X obtains two feature matrixes K and V through two linear changes, and the formula is as follows:

K(X),V(X)＝x ^T E _k ,x ^T W _v

step 4.2, calculating the similarity by performing dot product operation on the feature matrix K and the sub-word feature q, using mu to scale the space range, and using tanh () function to perform normalization processing, wherein the formula is as follows:

H(K,q)＝tanh(μKq)

and 4.3, performing weight adjustment on the changed global context characteristics through the similarity, wherein the formula is as follows:

Att＝softmax(H(x,q)V)

wherein, W _v ,E _k Is the weight matrix to be learned, mu is the scale factor control, x represents the input context global features, q represents the sub-word internal feature vectors,

and representing the context feature vector after fusing the internal features of the subwords.

Further, in step S7, the CRF decoder extracts the relationship features between the entity combination and the tag by calculating the transition features of the tag, so as to predict the entity tag type.

The invention also discloses a named entity recognition system based on attention mechanism and in-word semantic fusion, which comprises the following modules:

the data preprocessing module is used for dividing the text to obtain a sentence sequence required by model input;

the embedding module comprises a sentence sequence for acquiring character embedding, words in the sentences are matched, and sub-word embedding vectors are acquired;

the coding module extracts context characteristics based on the word-embedded sentence sequence and extracts the internal semantic characteristics of the sub-words;

the WordFusionAttention module dynamically fuses semantic features inside sub-words through an improved attention mechanism for context features based on word-embedded sentence sequences, enriches the features of sentences and facilitates the model to understand semantic information of the words;

and the decoding module is used for extracting the transfer characteristics of the label and predicting the label.

The invention also discloses a named entity recognition device based on attention mechanism and intra-word semantic fusion, which comprises a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

a processor for executing the steps of the named entity recognition method based on attention mechanism and intra-word semantic fusion described above when running the computer program.

The invention also discloses a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by at least one processor, the steps of the named entity identification method based on the attention mechanism and the intra-word semantic fusion are realized.

Has the advantages that:

the method comprises the steps of matching information of professional words in an input sentence through a professional word bank, obtaining sub-word representation based on word vectors through a large-scale pre-training CHINESE-BERT model, obtaining high-dimensional sub-word embedding space information through splicing and filling, learning intra-word semantic features of a plurality of sub-words corresponding to each word of the sentence through a CNN semantic network, and fusing the intra-word semantic features into the context features based on the word vectors through an attention system.

Drawings

FIG. 1 is a flow chart of an entity identification task of the present invention;

FIG. 2 is a block diagram of a CNN semantic network;

FIG. 3 is a block diagram of a WordFusionAttention module;

FIG. 4 is a model diagram of the named entity recognition method of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments. The following embodiments are merely illustrative of the technical concepts and features of the present invention, and are intended to enable one skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All equivalents and modifications made according to the spirit of the present invention are intended to be included within the scope of the present invention.

As shown in FIG. 1, the named entity recognition method based on attention mechanism and intra-word semantic fusion of the invention comprises the following steps:

s1, inputting text data into a preprocessing module to obtain an input sequence S = { x } taking a sentence as a unit ₁ ,x ₂ ,...,x _n In which x _i Representing the ith character in the sentence.

In this example, in order to make the input value of the model have a fixed dimension, the input sequence S is tail-filled with a value of 0, resulting in an input sequence with a sentence length of 50.

S2, inputting the input sequence S in the step S1 into a sub-word fitter for dynamic searching to obtain a sub-word embedded vector W _i ^j 。

In the embodiment of the invention, the sub-word regulator comprises the following steps:

and S2.1, constructing a dictionary tree T according to the special dictionary in the step S1.

S2.2, searching the word corresponding to each character in the input sequence S by using the dictionary tree T in the step S2.1 to obtain a word set W corresponding to each character _i,j I belongs to n, j belongs to l; wherein, W _i,j The method comprises the steps of representing the jth matched word beginning with the ith character in a sentence, and l representing the number of the matched words beginning with the ith character in the sentence; no. is written to the set of words that are not matched.

S2.3, inputting the sub-words in the word set obtained in the step S2.2 into a CHINESE-BERT model to obtain an embedded vector of each sub-word

Then, tail filling 0 and zero matrix splicing are carried out on the embedded vector of each sub-word to obtain a sub-word embedded vector W corresponding to the final word _i The formula is as follows:

CW _i ^j ＝e(W _i,j )

W _i ＝Ex(CW _i ^j )

wherein e (-) represents the loaded CHINESE-BERT model; CW _i ^j Dimension of

Representing the length of the jth sub-word corresponding to the ith word; ex (·) denotes a splicing operation; w _i The dimensions are 16 × 32 × 768.

In the embodiment of the invention, in order to embed the sub-word corresponding to each word into the vector W _i The dimensionality is the same, the maximum length of a single word is set to be 32, and the maximum number of sub-words corresponding to each word is set to be 16.

S3, inputting the word sequence obtained in the step S2 into a CNN semantic network to obtain a sub-word internal feature vector V _W 。

Further, the CNN module in step S3 includes two convolutional layers and one pooling layer. Wherein, the first convolution layer has 7 convolution kernels with the size of 16 multiplied by 9, the filling is 0, and the step length is 2; the second convolution layer has 7 convolution kernels with a size of 1 × 3 × 3, a padding of 1, and a step size of 2; a maximum pooling operation is followed by the first convolutional layer, with a window of 1 × 3 and a step size of 1; inputting the sub-word embedded vector into a first convolution layer to obtain shallow semantic features F1 inside words; the shallow semantic feature vector is down-sampled through a pooling layer to obtain a semantic feature vector F2; extracting deep semantic features inside the words from the semantic feature vector F2 through a second convolution layer to obtain deep semantic features V with the dimension of 1 multiplied by 6 multiplied by 32 _W The method specifically comprises the following steps:

s3.1, embedding the sub-words, firstly inputting the sub-words into a 9 multiplied by 9 convolutional layer to extract shallow semantic features F1 of the relation between the characters in the sub-words, wherein the specific formula is as follows:

F1＝k _i ·x

wherein k is _i For the ith convolution kernel parameter, x represents the input value.

S3.2, the shallow semantic features F1 can extract the key semantic features F2 of the words in the subwords through pooling operation with a window of 1 x 3.

S3.3, extracting deep semantic features F3 in the sub-words through a 3 x 3 convolutional layer by using the key semantic features F2, and then inputting the deep semantic features F3 into an activation function

The concrete formula is as follows:

where x is the input feature and ε is the impact factor.

FIG. 1 is a block diagram of an intra-word semantic network that includes two convolutional layers and one max-pooling layer. The sub-word embedding vector with the dimension of 16 multiplied by 32 multiplied by 768 firstly passes through a convolution layer of 9 multiplied by 9, shallow semantic features of professional words of 7 multiplied by 12 multiplied by 130 are extracted, then dimension reduction is carried out through maximum pooling operation, low-dimensional features of 7 multiplied by 12 multiplied by 64 are obtained, finally standard convolution operation is carried out with 1 convolution kernel of 3 multiplied by 3, and deep semantic features of 1 multiplied by 6 multiplied by 32 in the professional words are obtained.

S4, inputting the input sequence S obtained in the step S1 into a CHINESE-BERT model to obtain a sentence sequence vector S at a character level _c The dimension is 1 × 50 × 768.

S5, sentence sequence vector S in the step S4 _c Inputting the context global characteristics V into a BI-LSTM network to obtain the context global characteristics V of the character level _C 。

In the embodiment of the invention, the BI-LSTM network consists of forward LSTM and backward LSTM networks, and captures the context characteristics from left to right and from right to left respectively, so that the global context characteristic information of a sentence can be better acquired. The LSTM network includes an input gate, a forgetting gate, and an output gate mechanism.

The input gate is defined as:

i _t ＝σ([h _t-1 ,s _t ]·w _i +b _i )

the forgetting gate is defined as:

f _t ＝σ([h _t-1 ,s _t ]W _f +b _f )

the output gate is defined as:

O _t ＝σ([h _t-1 ,s _t ]·W _o +b _o )

h _t ＝O _t ⊙tanh(C _t )

wherein an indicates vector element multiplication.

Sentence sequence vector S _C Adopting forward input and reverse input, obtaining two different intermediate layer representations through calculation, and splicing the two vector representations to be used as the output of a hidden layer:

through the last hidden layer, obtaining the global context characteristic V _C ＝{h ₁ ,h ₂ ,…,h _n Dimension 1 × 6 × 32.

S6, the characteristic vector V in the sub-word obtained in the step S3 _W And the character-level context global characteristics V obtained in the step S5 _C Inputting the final context feature vector V of the text sentence into a WordFusionAttention module _S 。

Furthermore, the WordFusionAttention module in step S6 is composed of an improved dot product attention mechanism, and the context global feature V obtained in step S5 is dynamically adjusted by the attention mechanism in combination with the semantic information in the subwords _C The WordFusionAttention module is composed of an improved dot product attention mechanism as follows:

step 6.1, the global context feature X is subjected to two linear changes to obtain two feature matrixes K and V, and the formula is as follows:

K(X),V(X)＝x ^T E _k ,x ^T W _v

step 6.2, calculating the similarity by performing dot product operation on the feature matrix K and the subword feature q, using mu to scale the space range, and using a tanh () function to perform normalization processing, wherein the formula is as follows:

H(K,q)＝tanh(μKq)

and 6.3, performing weight adjustment on the changed global context characteristics through the similarity, wherein the formula is as follows:

Att＝softmax(H(x,q)V)

FIG. 2 is a block diagram of a WordFusionLayer module. Using W _k ,W _v The matrix carries out spatial transformation on the context global features, and then calculates the similarity between the global features of the sentence context and the internal features of the sub-words through dot product, so that the local features of the professional words can be enhanced, and the fusion of the boundary information of the professional words is realized.

S7, inputting the final context feature vector obtained in the step S6 into a CRF decoder to learn sentence internal feature constraints, and outputting an entity tag sequence;

further, the CRF encoder in step S7 includes a label transfer matrix, a scoring function, and a loss function, where the label transfer matrix M is a training weight, and scores the predicted label sequence L of the input sentence S as:

normalizing all possible label sequences to obtain the probability of a prediction sequence L, wherein the objective function is as follows:

the loss function is:

log(p(l|S))＝score(S,l)-log(∑score(S,l′))

the final prediction is defined as:

and the data preprocessing module is used for dividing the text to obtain a sentence sequence required by the model input.

And the embedding module comprises a sentence sequence for acquiring embedded characters, matching words in the sentences and acquiring sub-word embedded vectors.

And the coding module is used for extracting context characteristics based on the word embedded sentence sequence and extracting the internal semantic characteristics of the sub-words.

The WordFusionAttention module dynamically fuses the semantic features inside the sub-words through an improved attention mechanism for the context features based on the word-embedded sentence sequence, enriches the sentence features, and facilitates the model to understand the semantic information of the professional words.

The implementation functions of the modules of the named entity recognition system are implemented by a named entity recognition method based on attention mechanism and intra-word semantic fusion, which are not described herein again.

The invention also discloses a named entity recognition device based on attention mechanism and in-word semantic fusion, which comprises a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor.

A processor for executing the steps of a named entity recognition method based on attention mechanism and intra-word semantic fusion as described above when running the computer program.

The invention further discloses a storage medium on which a computer program is stored, which computer program, when being executed by at least one processor, carries out the above steps of a named entity recognition method based on attention mechanism and intra-word semantic fusion.

The above embodiments are merely illustrative of the technical concepts and features of the present invention, and the purpose of the embodiments is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A named entity recognition method based on attention mechanism and in-word semantic fusion is characterized by comprising the following steps:

s1, dividing text data to obtain a text sequence with sentences as units;

s3, the CNN semantic network learns the local semantic information in the sub-word representation obtained in the step S2, and extracts the internal features of the sub-words;

2. The named entity recognition method based on attention mechanism and intra-word semantic fusion as claimed in claim 1, wherein the sub-word fitter in step S2 is to match existing words in a lexicon from a text, concatenate sub-word representations at the beginning of the same character to become sub-word representations in the text, and specifically comprises the following steps:

s2.1, constructing a word library into a dictionary tree T, wherein each node in the dictionary tree stores each Chinese character of a word, a root node stores a first Chinese character of the word, a back pointer of the node points to a next Chinese character of the word, and a front pointer of the node points to a previous Chinese character of the word;

s2.2, traversing each character in the text sequence obtained in the S1, searching by using the dictionary tree T in the S2.1, and obtaining a word set W corresponding to each character by taking each character in the input sequence S as a word at the beginning _i,j I belongs to n, j belongs to l; wherein, W _i,j The method comprises the steps of representing a j-th word which is matched and begins with the ith character in a sentence, and l representing the number of the matched words which begin with the ith character in the sentence; writing a none value into a set which is not matched with the words;

CW _i ^j ＝e(W _i,j )

W _i ＝Ex(CW _i ^j )

3. The method for identifying named entities based on attention mechanism and intra-word semantic fusion as claimed in claim 1, wherein the CNN semantic network in step S3 comprises two convolutional layers and one pooling layer, wherein the convolutional kernel size in the first convolutional layer is 9 x 9; the convolution kernel size in the second convolution layer is 3 x 3; a max pooling operation behind the first convolutional layer, with a window of 1 x 3; inputting the sub-word embedding vector into a first convolution layer to obtain shallow semantic features inside words; the shallow semantic feature vector is down-sampled through a pooling layer to obtain a semantic feature vector; and extracting deep semantic features inside the words from the semantic feature vectors through the second convolution layer to obtain the deep semantic features.

4. The method for named entity recognition based on attention mechanism and intra-word semantic fusion as claimed in claim 1, wherein the wordfusion attention module calculates similarity between the sentence context and the sub-word feature through dot product operation in step S6 to dynamically weight, specifically comprising the following steps:

K(X),V(X)＝x ^T E _k ,x ^T W _v

H(K,q)＝tanh(μKq)

Att＝softmax(H(x,q)V)

wherein, W _v ,E _k Is to be studiedThe learned weight matrix, mu, is the control of the scaling factor, x represents the context global feature of the input, q represents the sub-word internal feature vector,

5. The method for identifying named entities based on attention mechanism and intra-word semantic fusion as claimed in claim 1, wherein the CRF decoder in step S7 predicts the entity tag type by calculating the branch feature of the tag and extracting the relationship feature between the entity combination and the tag.

6. A named entity recognition system based on attention mechanism and intra-word semantic fusion is characterized by comprising the following modules:

the encoding module is used for extracting context characteristics based on the word embedded sentence sequence and extracting the internal semantic characteristics of the sub-words;

7. A named entity recognition device based on attention mechanism and intra-word semantic fusion is characterized by comprising a memory and a processor;

a memory for storing a computer program capable of running on the processor;

a processor for performing the steps of the named entity recognition method based on attention mechanism and intra-word semantic fusion according to any one of claims 1 to 5 when running the computer program.

8. A storage medium having stored thereon a computer program which, when executed by at least one processor, carries out the steps of the method for named entity recognition based on attention-driven and intra-word semantic fusion according to any one of claims 1 to 5.