CN111178074A

CN111178074A - Deep learning-based Chinese named entity recognition method

Info

Publication number: CN111178074A
Application number: CN201911271419.XA
Authority: CN
Inventors: 罗韬; 冯爽; 徐天一; 赵满坤; 于健; 喻梅; 于瑞国; 李雪威
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-05-19
Anticipated expiration: 2039-12-12
Also published as: CN111178074B

Abstract

The invention relates to a Chinese named entity recognition method based on deep learning, which is characterized by comprising the following steps: the identification method comprises the following steps: 1) embedding a word position information mixed vector into a data text; 2) inputting the vectors obtained in the step into a Bi-LSTM layer for vector coding, and simulating a long-term relation between time series capture vectors; 3) inputting the vector output by the Bi-LSTM layer into a self-attention layer, definitely learning the dependency relationship between any two characters in a sentence, and capturing the internal structure information of the sentence; 4) and inputting the output vector sequence into a CRF layer, making an independent marking decision, and decoding the label. The method is scientific and reasonable in design, capable of operating multiple data sets, high in applicability and accuracy and capable of being applied to the named entity recognition model of the multi-field text.

Description

Deep learning-based Chinese named entity recognition method

Technical Field

The invention belongs to the technical field of natural language processing, knowledge maps and sequence marking, relates to a deep learning technology and a sequence marking technology, and particularly relates to a Chinese named entity recognition method based on deep learning.

Background

Named entity recognition belongs to one of the field of sequence labeling, is a basic task of natural language processing, and mainly refers to finding out entities with specific meanings in texts, including names of people, places, names of organizations and some specific proper nouns. The identified task mainly comprises two parts: entity boundaries identify and determine their entity categories (person, place, and organizational names, etc.), where named entities serve as basic elements of text and also as basic units for understanding the content of articles. More, named entity recognition is used as an upper-layer basic task for processing text data such as a knowledge graph, wherein the accuracy of named entity recognition directly influences the final effect of knowledge graph construction. The knowledge graph is established on the relationship between the entities, and if the entity extraction is wrong, the subsequent entity relationship cannot be determined; the same applies to the automatic summarization and question-answering system, and when semantic analysis is performed on a sentence, related named entities in the text must be found. Thus, named entity recognition is extremely critical and important for text data processing, especially natural language processing.

At present, popular named entity recognition methods include CRF models, LSTM models, and models combining LSTM and CRF, which are popular named entity recognition models. Compared with an independent single model, the mixed model combining the LSTM and the CRF combines the advantages of the LSTM and the CRF, can memorize the dependency relationship between long-distance sequences and also utilizes the advantages of CRF labeling, so that the method is widely applied to the field of named entity recognition, and is optimized and improved on the basis of the method. Zhang et al studied a new dynamic meta-embedding method in 2019 and applied it to the chinese NER task. The method creates dynamic, data-specific and task-specific meta-embedding, since meta-embedding of the same character in different sentence sequences is different. Experiments on the MSRA and litteraurener data sets verified the validity of the model and the latest results were obtained on litteraurener.

Although many methods have been proposed in recent years, these methods generally do not produce good results on multiple datasets, and at the same time, there is no universal named entity recognition model which has strong applicability and high accuracy and can be applied to multiple fields.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a Chinese named entity recognition method based on deep learning, can operate multiple data sets, has strong applicability and high accuracy, and can be applied to a named entity recognition model of multi-field texts.

The technical problem to be solved by the invention is realized by the following technical scheme:

a Chinese named entity recognition method based on deep learning is characterized in that: the identification method comprises the following steps:

1) embedding a word position information mixed vector into a data text;

2) inputting the vectors obtained in the step into a Bi-LSTM layer for vector coding, and simulating a long-term relation between time series capture vectors;

3) inputting the vector output by the Bi-LSTM layer into a self-attention layer, definitely learning the dependency relationship between any two characters in a sentence, and capturing the internal structure information of the sentence;

4) and inputting the output vector sequence into a CRF layer, making an independent marking decision, and decoding the label.

Moreover, the specific operation of the step 1) is as follows:

a. establishing a dictionary according to a training data set, obtaining one-hot vectors of each word, wherein the length is the dictionary length V, and then mapping the one-hot vectors into low-dimensional dense vectors by using a pre-trained single word position vector matrix through a look-up layer;

b. carrying out vector splicing on three-part character vectors of the word vector, character vectors divided into words and word position vectors, wherein the vectors are used as the input of a network model and are used for a Chinese token sequence

X＝(x₁，x₂，x₃，…x_n，)

Checking whether a token X exists in the word lookup table and the character lookup table or not, and when the token X exists in all the two tables, namely the token consists of one character, combining vectors of two embedded items to serve as distributed representation of the token; otherwise, only the embedding in one lookup table is used as the output of the embedding layer, and the word position vector is initialized to the word vector of the word in which the word is located.

Moreover, the specific operation of the step 2) is as follows: inputting the word and word mixed vector of each word in an input sequence into a Bi-LSTM layer as each time step of the network, extracting global features, and obtaining an implicit output sequence (h) of the forward LSTM through a bidirectional LSTM network₁，h₂...h_n) And implicit output sequence of inverse LSTM

Splicing the two groups of implicit sequences according to positions to obtain a complete implicit sequence

This implicit sequence is taken as input for the next layer.

Moreover, the specific operation of the step 3) is: for each time step input, H ═ H₁，h₃，...h_nAnd (3) expressing the output of the B-iLSTM hidden layer, and performing linear transformation on the input vector once according to the principle of a multi-head attention mechanism, and performing scaling on the input vector to obtain a dotproduct, wherein the attention formula is as follows:

wherein:

is a query matrix;

is a key matrix;

is a matrix of values;

d is the dimension of the hidden unit of Bi-LSTM, numerically equal to 2d_h；

Setting Q-K-V-H, the multi-head attention first linearly projects the query, key and value H by using different linear projections, then H projects the scaled dot product attention in parallel, finally concatenates these attention results and projects again to get the new representation.

Moreover, the specific operation of the step 4) is as follows: and accessing the result to a CRF layer, wherein the CRF layer comprises a transfer matrix which represents transfer scores among all labels, and the score of each label corresponding to each word in the CRF layer consists of two parts: and adding legal constraint between predicted labels through a transition matrix in a CRF layer by the sum of the unary emission score output by the LSTM model and the binary transition score in the CRF, increasing the rationality of label grammar, and finally deducing a label sequence with the highest score by using Viterbi decoding on the prediction of the labels.

The invention has the advantages and beneficial effects that:

the Chinese named entity recognition method based on deep learning can operate multiple data sets, is high in applicability and accuracy, and can be applied to a named entity recognition model of multi-field texts.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graph of the number of iterations of the present invention on a MSRA data set versus the change in value of model F1;

FIG. 3 is a graph of the number of iterations of the invention on a LiteratureNER dataset plotted against the change in value of model F1;

FIG. 4 is a graph of the number of iterations on the MSRA data set versus the change in model Accuracy values in accordance with the present invention;

FIG. 5 is a graph of the number of iterations of the invention on a LiteratureNER dataset versus the change in model Accuracy values.

Detailed Description

The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

As shown in fig. 1, a method for identifying a named entity in chinese based on deep learning is characterized in that: the identification method comprises the following steps:

1) embedding a word position information mixed vector into a data text;

X＝(x₁，x₂，x₃，…x_n，)

Checking whether a token X exists in the word lookup table and the character lookup table or not, and when the token X exists in all the two tables, namely the token consists of one character, combining vectors of two embedded items to serve as distributed representation of the token; otherwise, only embedding in a lookup table is used as the output of the embedding layer, and the word position vector is initialized to the word vector of the word where the word is located;

mixing words of individual words in an input sequenceThe vector is input to a Bi-LSTM layer as each time step of the network, global features are extracted, and an implicit output sequence (h) of the forward LSTM is obtained through a bidirectional LSTM network₁，h₂...h_n) And implicit output sequence of inverse LSTM

Taking the implicit sequence as the input of the next layer;

for each time step input, H ═ H₁，h₃，...h_nAnd (3) expressing the output of the B-iLSTM hidden layer, and performing linear transformation on the input vector once according to the principle of a multi-head attention mechanism, and performing scaling on the input vector to obtain a dotproduct, wherein the attention formula is as follows:

wherein:

is a query matrix;

is a key matrix;

is a matrix of values;

d is the dimension of the hidden unit of Bi-LSTM, numerically equal to 2d_h；

Setting Q-K-V-H, the multi-head attention firstly linearly projecting the query, the key and the value H by using different linear projections, then projecting H and performing scaled dot product attention in parallel, finally connecting the attention results, and projecting again to obtain a new representation;

4) inputting the output vector sequence into a CRF layer, making an independent marking decision, and performing label decoding; and accessing the result to a CRF layer, wherein the CRF layer comprises a transfer matrix which represents transfer scores among all labels, and the score of each label corresponding to each word in the CRF layer consists of two parts: and adding legal constraint between predicted labels through a transition matrix in a CRF layer by the sum of the unary emission score output by the LSTM model and the binary transition score in the CRF, increasing the rationality of label grammar, and finally deducing a label sequence with the highest score by using Viterbi decoding on the prediction of the labels.

5) Model training:

a. the network reads the training sample for training, iteration is carried out from 1, and when the maximum iteration number is larger than K, the training is stopped; and for each input training data set, calculating a loss difference value of current output according to a loss function, wherein the loss difference value is used for measuring the training degree of the model, if the loss is greater than a preset minimum loss value, the model needs to be trained and adjusted continuously, network parameters of each layer need to be updated in sequence by using a back propagation algorithm, if the loss is less than the preset minimum loss value, the model reaches a training standard, the training is finished, and the program exits.

b. After the traversal of the current batch of the training data set is completed, the model training degree is verified by using the verification set, if the current verification result is superior to the best result of the historical verification, the current training is effective, the performance of the model is in an ascending stage, the training can be tried to continue, the current data is recorded, the best result of the historical is replaced by the current verification result, and the next training is continued. If the verification result is not improved in M times of continuous training, which may indicate that the span of the learning rate is too large and may just cross the extreme value part of the minimum loss, we may consider to reduce the learning rate appropriately, try to continue training, repeat iteration until the learning rate is lower than the preset value of the system, and end the training and exit.

c. After the model training is finished, we will test the training condition of the model, and the testing process of the model is as follows:

(1) and loading the network parameters obtained through training into the model, and inputting a test data set.

(2) And the network receives the test data set and obtains final test output through a forward propagation algorithm.

(3) And comparing the output sequence of the network model with the correct labeling sequence for calculation.

(4) And finally, counting accuracy, F1 value and recall rate.

The experiments of this example were performed on microsoft's MSRA news dataset and the published litetarur ner dataset, respectively.

MSRA is from SIGHAN 2006 and is a shared task for chinese named entity identification. The data set contains 3 entity types: personnel, organization, and location. Statistics show that the data set contains 48998 sentences for training and 4432 sentences for testing, and the present example takes one tenth of the training set as the validation set because the MSRA data set lacks a validation set.

The liteatureer dataset was constructed from hundreds of chinese literature articles, excluding too short, too cluttered articles. There are 9 entity types: people, organizations, locations, abstractions, times, things, and metrics. The specific contents of the data set segmentation are as follows: 26320 training statements, 2045 verification statements, and 3016 test statements.

In this example, to prove the superiority of the data, the experimental methods of multiple periodicals are selected as baseline results for comparison, and the final performance of the model is shown in tables 1 and 2 to be compared with the performance of the comparison model.

In the present example, evaluation indexes similar to those of the conventional work were adopted, namely Precision P, Recall R and F1-score (F1). The accuracy reflects the ratio of the number of correctly predicted tokens to the number of predicted tokens; the recall rate reflects the ratio of correctly predicted tokens to all tokens in the data used; f1 is the harmonic mean of precision and recall. The following formula gives the calculation formula of three indexes:

wherein: TP is the number of tokens which are judged as positive examples by the model and are actually positive examples;

FP is the number of tokens which are determined to be positive examples by the model but are actually negative examples;

TN is the number of tokens that the model determines as negative but actually positive;

FN is the number of tokens that are determined by the model to be negative and actually negative.

Finally, experiments show that the new model obtains a better result on the MSRA without any manually-made characteristic template, the F1 value reaches 91.37%, and the F1 value on the LiteratureNER data set reaches 73.23%, which is better than the results of other people before, and is improved by 0.5% and 0.2% compared with the Zhang, so that the current best performance of the task is achieved, and the method has the characteristics of capability of running multiple data sets, strong applicability, high accuracy and capability of being applied to multi-field texts.

Table 1 comparison of results on dataset MSRA

Table 2 comparison of results on dataset LiteratureNER

In the training process of the model, a researcher can judge the training state of the model through the iteration times, the effect change curve graphs of all labels of the model and the Accuracy graph. Wherein, the evaluation Accuracy refers to the ratio of the number of correctly predicted samples in the verification set to the total number of predicted samples, and it does not consider whether the predicted samples are positive or negative, and the calculation formula of Accuracy is as follows:

thus, this example scales out 33 of the experimental setup for display in 100 iterations of the experimental setup and plots the F1 values for the model on both datasets separately, as shown in fig. 2-5.

It can be seen from the figure that the convergence speed of the F1 value of the model from the iteration start of the MSRA data set is relatively fast, the MSRA data set tends to be stable after about 15 iterations and keeps floating in a small range, while the setaterurener data set tends to be in a stable state at the initial training stage, which is related to the composition and the data volume of the two data sets, because the data volume of the MSRA data set is relatively large, the training is required to be longer to reach the stable state, however, the curve change diagram of the F1 value of the model seen by the two data sets well illustrates that the model can be fast converged and cannot fall into an overfitting state, and the model is well suitable for the task of identifying the named entities in chinese.

Although the embodiments of the present invention and the accompanying drawings are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the disclosure of the embodiments and the accompanying drawings.

Claims

1. A Chinese named entity recognition method based on deep learning is characterized in that: the identification method comprises the following steps:

1) embedding a word position information mixed vector into a data text;

2. The method for recognizing Chinese named entities based on deep learning of claim 1, wherein: the specific operation of the step 1) is as follows:

X＝(x₁，x₂，x₃，...x_n，)

3. The method for recognizing Chinese named entities based on deep learning of claim 1, wherein: the specific operation of the step 2) is as follows: inputting the word-word mixed vector of each word in an input sequence into Bi-L as each time step of the networkSTM layer, extracting global characteristics, and obtaining hidden output sequence (h) of forward LSTM through bidirectional LSTM network₁，h₂...h_n) And implicit output sequence of inverse LSTM

This implicit sequence is taken as input for the next layer.

4. The method for recognizing Chinese named entities based on deep learning of claim 1, wherein: the specific operation of the step 3) is as follows:

wherein:

is a query matrix;

is a key matrix;

is a matrix of values;

d is the dimension of the hidden unit of Bi-LSTM, numerically equal to 2d_h；

5. The method for recognizing Chinese named entities based on deep learning of claim 1, wherein: the specific operation of the step 4) is as follows:

and accessing the result to a CRF layer, wherein the CRF layer comprises a transfer matrix which represents transfer scores among all labels, and the score of each label corresponding to each word in the CRF layer consists of two parts: and adding legal constraint between predicted labels through a transition matrix in a CRF layer by the sum of the unary emission score output by the LSTM model and the binary transition score in the CRF, increasing the rationality of label grammar, and finally deducing a label sequence with the highest score by using Viterbi decoding on the prediction of the labels.