CN114154504A

CN114154504A - Chinese named entity recognition algorithm based on multi-information enhancement

Info

Publication number: CN114154504A
Application number: CN202111472663.XA
Authority: CN
Inventors: 黄胜; 廖星
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-08

Abstract

At present, a Chinese named entity recognition method based on combination of character information and word information obtains good effect, and on the basis, a method for enhancing information by using font information also obtains certain improvement on performance. However, the problems of lack of input semantic information and entity recognition errors caused by nested entities have not been solved. To address these issues, the MIEM (Multi-Information Enhancement Method) model is proposed herein. The MIEM firstly enhances input characteristics by adding part-of-speech information into an embedded layer, adds a nested entity position information matrix based on binary tree structure coding into position information coding, then codes the embedded information by using a self-attention mechanism, and in addition, an MD layer (more details layer) is designed to replace the view field of a traditional residual error structure expansion model so as to acquire more information. The design not only enhances the expression of input information, but also enhances the entity boundary information, and solves the problems that the entity boundary is not clear and the entity identification accuracy is influenced by nested entities. Finally, a neural network model enhanced based on the embedded information and the position coding information is constructed to solve the problem of the recognition error of the named entity caused by the nested entity in the Chinese named entity recognition.

Description

Chinese named entity recognition algorithm based on multi-information enhancement

Technical Field

The invention relates to the field of deep learning and computer natural language processing, in particular to a named entity identification method based on multi-information enhancement.

Background

With the continuous development of the field of artificial intelligence, natural language processing is more and more widespread in practical application, and Named Entity Recognition (NER) is used as a basic technology of natural language processing, the accuracy of the NER determines the effect of downstream tasks, the importance of the NER plays an important role in many downstream tasks (such as translation, question and answer models, search matching, semantic analysis and the like) of natural language processing, and entities recognized by the NER mainly comprise 3 categories (Entity category, time category, numerical category), 7 subclasses (name, place name, organization name, time, date, currency, percentage) and proper nouns. The NER is essentially a sequence tagging problem, which aims to accurately identify entities in a text and classify the entities into a certain class, but at present, the identification accuracy of named entities in social media such as microblogs is not high.

On one hand, because Chinese characters have more complicated semantics compared with English, the expression of the same word has more diversity. Whereas english words have some natural part-of-speech information, such as some words: "action", "education", "organization" and so on all have the same root word "-edition", and at the same time, these words also have the same noun part of speech; it is also better that "adjustable", "respectable", "reasonable", etc. all have the same root "-able", and these words also have the same adjective part of speech. In addition, there are many similar characteristics in english words, so that an english word has some extra part-of-speech information that chinese words do not have. On the other hand, in the common usage, there is usually the problem of entity nesting, which refers to the entity appearing in the text, there is a case that a certain shorter entity is contained in another longer entity, and there are many statements in which nested entities exist, for example, "american project management association" is a nested entity, "american project management association" is an organization, but "usa" is a place name. When such a situation exists, entity identification is difficult, and it can be said that the existence of the nested entity is an important factor influencing the entity identification accuracy.

In the field of natural language processing, named entity recognition tasks are firstly performed based on word segmentation, and the method has a main problem that error information is often spread due to inaccuracy of word segmentation; thereafter, character-based named entity recognition methods overcome this problem, but lack the underlying word information. There are some problems in the Chinese named entity recognition task based on characters or word segmentation only, (Yue Zhang, Jie Yang. Chinese NER using the same LSTM [ C ]// Processing of the 54th Annual Meeting of the Association for the computerized Linear, ACL,2018:1554 and 1564.) the Chinese named entity recognition task is performed by combining the characters and word information. Recently, (Shuang Wu, Xiaoning Song, Zhenhua Feng. MECT: Multi-Metadata Embedding based Cross-transform for Chinese Name Entitude Recognition [ C ]// Proceedings of the 59th Annual Meeting of the Association for computer Linear constraints, ACL,2021:1529 and 1539.) in the Embedding layer, the radical information of the character is added for the input information, and a certain effect is obtained. In view of the current research trend, the accuracy of the chinese named entity recognition is still in need of improvement in some fields, such as social media.

In summary, in consideration of the problems of nested entities and low accuracy in the named entity recognition network based on deep learning, the method for recognizing the named entities based on multi-information enhancement is designed, and by enhancing two aspects of embedded information and position information, a model can learn richer input features and learn the information of the nested entities, so that the accuracy of the Chinese named entity recognition is improved.

Disclosure of Invention

The invention aims to design a multi-information enhanced Chinese named entity recognition algorithm to accurately recognize entities from texts, and fine-tune a pre-training model for the field of specifically realizing named entity recognition based on the method so as to achieve the optimal effect.

The invention provides a Chinese named entity recognition method based on multi-information enhancement, which comprises the following steps: the embedded information module is used for processing an input sentence, adding part-of-speech information for the input of Chinese named entity recognition, transferring the part-of-speech information based on words to a character level for input, inputting the information in an embedded layer, fusing the character information, the word information and the part-of-speech information to be used as input characteristics, simultaneously sending the constructed nested entity position information matrix code based on binary tree structure coding and the input characteristics into a self-attention mechanism together, modeling the input information of the embedded layer, and capturing details of the output of the self-attention mechanism by utilizing a feedforward neural network and a provided novel residual error structure to obtain a deep expression. And the conditional random field is used for learning the relation between the labels to obtain the final entity prediction result.

The invention mainly comprises two parts: an embedded information enhancement method and a position coded information enhancement method.

The method specifically comprises the following steps:

1. acquiring an input sentence, performing part-of-speech tagging on the input sentence, transferring the part-of-speech tagging to a character level, and finally fusing character information, word information and part-of-speech information to serve as a final input characteristic;

2. constructing a Chinese named entity recognition network based on multi-information enhancement, which mainly comprises part-of-speech information enhancement and nested entity matrix information enhancement;

3. pre-training the network by utilizing an open source data set;

4. fine tuning a pre-established neural network by using a small amount of self-made labeled Chinese named entity identification data sets in a transfer learning mode;

5. and predicting named entity identification data in the prepared test set on the network after the transfer learning is finished to obtain a final detection entity.

The Chinese named entity recognition network based on multi-information enhancement in the steps is the main content of the invention, and provides a double-information enhancement method for embedding information enhancement and position information coding enhancement.

In the embedding layer, firstly preprocessing is carried out on input, word information corresponding to characters is matched, meanwhile, word property information is added by using a natural language processing tool library spaCy, then, input word elements are matched by using pre-trained character vectors and word vectors, and finally output of the obtained word embedding information after passing through the linear layer is used as input of a model.

In the attention module, a transformer XL attention calculation method is adopted, wherein for a position coding part, the nested entity matrix position information coding based on the binary tree and the position information coding method of the Flat network are combined, so that the information of the nested entity is ensured, and the information among other morphemes is not lost. The attention module calculation method is as follows:

Att(A，V)＝softmax(A)V

wherein i represents the ith lemma, and ij represents the relationship between the ith lemma and the jth lemma. Q, K, V is different linear transformation of input matrix, where the input matrix is character, word and part-of-speech information features fused in embedding layer, u and v are learnable hyper-parameters, position information coding module R in attention mechanism_BinaryAnd R_FLATIs position information coding in an attention mechanism for modeling position information between lemmas in an input sentence, wherein R_BinaryThe coding mode can be found in the figure, and the complete position information coding is realized by splicing R_BinaryAnd R_FLATThe implementation, expressed as:

in the feedforward neural network module, the learned "distributed feature representation" is mapped to the sample label space through the linear layer. In order to learn more detailed features, the invention replaces the original residual error structure with the proposed MD layer, captures the detailed features with the MD layer, and finally outputs a feature matrix. In terms of the overall output structure of the network, the addition of the outputs of two parallel networks is used as the input of the overall CRF, so as to reduce errors and improve the robustness of the network.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. english words have some natural part-of-speech information, such as the word suffix "-tion", "-able" indicates nouns and adjective parts-of-speech. Compared with English, Chinese characters have more complex semantics, and the expression of the same word has more diversity, but has no such characteristics. Then, if the part-of-speech information is added for the input of the Chinese named entity recognition, the model can learn more abundant information, and can learn more semantic information by adding the part-of-speech information, so that the performance of the entity recognition model is improved. Therefore, the invention uses the natural language processing tool spaCy to label part of speech information, adds part of speech information to the embedding layer, and transfers label information of words to character information to better endow semantic features of input information, and the obtained form is shown as the embedding layer in the drawing. For the embedding mode of input, the invention uses the pre-trained word list to match the input character and word vector, and for the condition that no character or word vector exists, the invention carries out random initialization processing. The original character-based representation is represented by the word matched with the character, and finally the part-of-speech information obtained by the natural language processing tool is added to obtain the total input, wherein the input comprises the character information, the matched word information and the part-of-speech information.

2. The invention provides a position information code with entity nesting information, which is used for combining relative position information between word elements and position relation between nesting entities and solving the influence of the nesting entities on the identification accuracy of Chinese named entities. In the self-attention module, the position information and the input information are fused, so that the model can actively pay attention to the semantic relation and the position relation among the word elements.

3. For the residual part of the feedforward neural network, in order to obtain a larger receptive field, the invention provides a novel residual structure MD Layer (More Details Layer) to obtain More hidden information, and the specific position of the MD Layer in the model is shown in the attached drawing. The figure shows an implementation method of an MD layer, firstly, input features are amplified by N times through a linear layer, then the amplified features are sliced, and finally the sliced features are added to obtain final output, so that the dimension is guaranteed to be unchanged.

Drawings

In order to make the purpose, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for description:

FIG. 1 is a flow chart of the Chinese named entity recognition method based on multi-information enhancement according to the present invention;

FIG. 2 is a schematic diagram of a binary tree based position encoding structure according to the present invention;

FIG. 3 is a schematic diagram of a binary tree structure of a matrix form of position information encoding according to the present invention;

FIG. 4 is a calculation module of the present invention with an attention mechanism;

fig. 5 is a schematic diagram of an MD layer implementation of the present invention.

Detailed description of the preferred embodiments

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a Chinese named entity recognition algorithm based on multi-information enhancement, which specifically comprises the following steps as shown in figure 1:

step 1, inputting a sentence, and performing two simple preprocessing operations of word matching and part-of-speech matching on the sentence;

and 2, constructing a neural network fusing part-of-speech information and nested entity position information codes, and inputting the lemmas into the network for learning.

Step 3, performing attention training on the input features by using an attention mechanism, and automatically focusing the position by the model when the features appear later;

step 4, the output of the self-attention mechanism is sent to a linear layer for feature learning, wherein in order to acquire more detailed information, features are encoded through an MD layer;

and 5, sending the output of the encoder to CRF (conditional Random field) to obtain a final predicted entity.

Detailed Description

Step 1: and acquiring an input statement, and preprocessing the input statement by using operations such as word list matching, part of speech matching and the like in an input preprocessing module to enhance the input expression characteristics.

Step 2: inputting the preprocessed sentence into a self-attention mechanism module, and constructing a position coding structure based on a binary tree in the self-attention mechanism module, wherein a solid circle represents that a current node can form a word with two characters of a next node of a left subtree of the current node, and the word is exemplified by a sentence "Chongqing city changjiang river bridge" in fig. 2, and the words formed by two continuous characters are "Chongqing", "cixian", "Yangtze river" and "great bridge", and then the words are circled by oval solid lines. And the dotted circle represents that the current node may form a word with a plurality of nodes of its left sub-tree, which is denoted as "Chongqing City" and "Yangtze river bridge" in fig. 2. For the binary tree structured position information encoding method of fig. 2, the present invention is represented by the matrix of fig. 3. Wherein the diagonal line represented by the dotted line represents the connection between the nodes of the left subtree of the binary tree structure, the downward solid arrow represents a word that the current node can compose with two characters of the next node of its left subtree, and the rightward solid arrow represents a word that the current node can compose with multiple nodes of its left subtree. After the processing, the entity position code based on the binary tree structure is mapped to the matrix representation. The lemmas in the sentence are encoded according to the encoding mode, and the specific matrix input of the binary tree structure position information encoding is shown in fig. 3. The feature extraction module is an encoder module of a transform network, and an attention mechanism network after position coding is changed is used.

And step 3: the Chinese named entity recognition network is constructed by using a PyTorch framework, the position of a Multi-head attention mechanism in the overall framework is shown in FIG. 1, a calculation diagram of the Multi-head attention mechanism (Multi-HeadAttention) is shown in FIG. 4, and the overall calculation formula is as follows:

Att(A，V)＝softmax(A)V

in the formula, Q, K, V is different linear transformation of input vector, u and v are learnable hyper-parameters, and position information is fused

Comprises the following steps:

in the formula R_{FLAT_ij}The calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

h in (1)_i-h_jRepresents

In the same way, t_i-t_jRepresents

And

has been calculatedThe formula is as follows:

in the above formula, d_modelIs the dimension of the model, and the position d is obtained by the following calculation method:

in the formula, hh represents the distance from head [ i ] to head [ j ], wherein i represents the ith lemma, j represents the jth lemma, and tt represents the distance from tail [ i ] to tail [ j ].

And 4, step 4: in the feedforward neural network part of the network, in order to obtain a larger receptive field, the invention provides a novel residual error structure (MD) Layer (More Details Layer) to obtain More hidden information, and the specific position of the MD Layer in the model is shown in FIG. 1. Fig. 5 shows an implementation method of the MD layer, as shown in the figure, first, the input features are amplified by N times through the linear layer, then, the amplified features are sliced, and finally, the sliced features are added to obtain a final output, so as to ensure that the dimensionality is not changed. In the current Chinese named entity recognition task, the N value in the MD layer can be obtained through experiments, the experiment effect can be optimal by taking 2 as the N value, and meanwhile, in order to prevent overfitting during training, a layer normalization function (LayerNorm) is added to the feedforward neural network part.

And 5: and (3) sending the output of the coding part into a CRF layer for calculation, and obtaining a final prediction entity through constraint learning of the conditional random field on the label information.

Step 6: and training the constructed Chinese named entity recognition network. By means of transfer learning, firstly, the network is pre-trained by utilizing open source data of related fields, and then the pre-trained network is finely adjusted by utilizing a self-made labeled Chinese entity recognition data set.

Claims

1. A Chinese named entity recognition method based on multi-information enhancement is characterized in that text content can be processed to obtain a required proper noun, and the method specifically comprises the following steps:

step 1, collecting text sentences which need to be identified by a user, adding part-of-speech labels to input words through a natural language processing tool spaCy, transferring part-of-speech information of the words to a character level, and fusing characters, words and part-of-speech information to serve as embedded information;

step 2, constructing a Chinese named entity recognition network based on multi-information enhancement, which mainly comprises a part-of-speech information embedding module, a position information coding module of a nested entity matrix and a novel feedforward neural network module based on a detail capturing layer;

and 3, carrying out named entity recognition on the input sentence on the trained neural network to obtain the required entity type.

2. The method for recognizing the Chinese named entity based on the multi-information enhancement as claimed in claim 1, wherein the constructed network of the method for recognizing the Chinese named entity based on the multi-information enhancement comprises an information embedding module, a self-attention mechanism module based on the position information of a nested entity matrix, a novel feedforward neural network module and a CRF label constraint module, wherein the information embedding module obtains the embedded vector representation of characters and words by matching a pre-trained vocabulary, then adds part-of-speech tagging information and transfers the part-of-speech information to a character level expression, and for an unregistered word (Out OfVocacbuly), the part-of-speech tagging information is randomly initialized; the self-attention mechanism module sends the embedded information and the position information based on the nested entity matrix into the self-attention mechanism to obtain final characteristic input, wherein the position information enhancing part adopts embedded entity position information coding based on a binary tree structure and is fused with position information coding of an FLAT network; for the feedforward neural network module part, the detail Layer (More Details Layer) provided by the invention is used for capturing deeper feature information instead of a common residual error Layer, and the features obtained by a self-attention mechanism are relearned; the CRF (conditional Random field) label constraint module models the dependence or constraint inside the labeling sequence, learns the contact information among the labels, and finally outputs the prediction result.

3. The method as claimed in claim 2, wherein the model has part-of-speech information in the embedding layer, the part-of-speech information is added to the model through spaCy in the embedding layer and transferred to the character, and the part-of-speech information is fused with the character information and the word information in the embedding layer, providing richer features for the network model.

4. The method for Chinese named entity recognition based on Multi-information enhancement as claimed in claim 2, wherein the self-attention mechanism module (Multi-attention mechanism) encodes the embedded information by a Multi-head attention mechanism, learns the dependence on the long and short distances between the input lemmas, and the attention mechanism is calculated by:

Att(A,V)＝softmax(A)V

wherein i represents the ith lemma, and ij represents the relationship between the ith lemma and the jth lemma. Q, K, V are different linear transformations of input matrix, u and v are learnable hyperparameters, position information coding module R in attention mechanism_BinaryAnd R_FLATIs position information coding in attention mechanism, is used for modeling position information between word elements in input sentences, and the complete position information coding is formed by splicing R_BinaryAnd R_FLATExpressed as:

5. the method as claimed in claim 2, wherein the feedforward neural network module performs feature mapping on the output of the attention mechanism by using a linear Layer, wherein the More detailed feature information is obtained by replacing the common residual structure with a detail Layer (More Details Layer) proposed by the present invention.

6. The method for Chinese named entity recognition based on multi-information enhancement as claimed in claim 1, wherein the Chinese named entity recognition operation mainly comprises: the method comprises the steps of performing part-of-speech tagging on an input sentence, transferring part-of-speech tagging information to expression at a character level, fusing character information, word information and part-of-speech information to serve as output of an embedding layer, learning by using information of the embedding layer and nested entity matrix information in a self-attention mechanism, and performing feature mapping through an improved novel feedforward neural network to obtain an output sequence. And finally, sending the output sequence into a CRF layer for label constraint learning to obtain a named entity.